In a recent study published in JAMA Network Open, a team of researchers from Vanderbilt University explored the potential of ChatGPT, the Chat-Generative Pre-Trained Transformer, in delivering accurate medical information to both patients and healthcare professionals.
Study: Accuracy and Reliability of Chatbot Responses to Physician Questions. Image Credit: CkyBe / Shutterstock
Exploring the Potential of ChatGPT in Healthcare
ChatGPT, a large language model (LLM), has gained significant popularity due to its ability to process and generate human-like responses. By learning from vast amounts of data from across the internet, including articles, books, and other web sources, ChatGPT can understand and respond to user inquiries. This AI-powered chatbot has the potential to revolutionize the way medical professionals access information and improve healthcare efficiency.
By utilizing ChatGPT, physicians can quickly draw insights from medical data and stay informed about complex clinical decisions, eliminating the need to search through multiple references for necessary information. Similarly, patients can also benefit from accessing accurate medical information without solely relying on their doctors.
However, the accuracy and reliability of ChatGPT in providing medical information remain crucial. There have been cases where the chatbot generated convincing yet incorrect responses, leading to concerns about its reliability.
“Our study provides insights into model performance in addressing medical questions developed by physicians from a diverse range of specialties; these questions are inherently subjective, open-ended, and reflect the challenges and ambiguities that physicians and, in turn, patients encounter clinically.”
Study Design and Methodology
The study involved 33 physicians, faculty members, and recent graduates from Vanderbilt University Medical Center. They created a set of 180 questions spanning 17 pediatric, surgical, and medical specialties. Two additional question sets focused on melanomas, immunotherapy, and common medical conditions. In total, 284 questions were selected for assessment.
The questions were designed to have clear answers based on the medical guidelines available in early 2021, which coincided with the end of the training set for ChatGPT version 3.5. The questions varied in difficulty and were classified as easy, medium, or hard.
An investigator inputted each question into the chatbot, and the response was assessed by the physician who designed the question. The accuracy and completeness of the answers were scored using Likert scales. Accuracy scores ranged from 1 to 6, where 1 indicated a completely incorrect response and 6 indicated a completely correct response. Completeness scores ranged from 1 to 3, with 3 representing the most comprehensive answer. Questions with completely incorrect answers were not assessed for completeness.
The study reported both median [interquartile range (IQR)] and mean [standard deviation (SD)] scores. Statistical tests, such as Mann-Whitney U tests, Kruskal-Wallis tests, and Wilcoxon signed-rank tests, were conducted to analyze the differences between groups. Interrater agreement was also evaluated when multiple physicians scored a specific question.
Questions answered incorrectly were reevaluated between one and three weeks later to determine the reproducibility of the results. Additionally, the performance of ChatGPT version 4 was assessed by rescored immunotherapy and melanoma-related questions.
Key Findings
In terms of accuracy, ChatGPT had a median score of 5 (IQR: 1-6) for the initial set of 180 multispecialty questions, indicating that the majority of answers were “nearly all correct.” However, the average score was slightly lower at 4.4 [SD: 1.7]. For completeness, the median score was 3, representing comprehensive answers, but the mean score was 2.4 [SD: 0.7]. Thirty-six answers were classified as inaccurate, scoring 2 or less.
There was a moderate correlation (coefficient of 0.4) between completeness and accuracy scores for the initial set of questions. The study found no significant differences in ChatGPT’s completeness and accuracy across difficulty levels, descriptive versus binary questions, or different medical specialties.
During the reproducibility analysis, 34 out of the 36 questions were rescored. The chatbot’s performance notably improved, with 26 questions receiving higher accuracy scores, 7 remaining consistent, and only 1 showing a decrease. The median accuracy score increased from 2 to 4.
The immunotherapy and melanoma-related questions were evaluated twice. In the first round, ChatGPT achieved a median accuracy score of 6 (IQR: 5-6) and a mean score of 5.2 (SD: 1.3). Performance improved in the second round, with the mean accuracy score increasing to 5.7 (SD: 0.8). Completeness scores also demonstrated improvement, and the chatbot performed well on questions related to common medical conditions.
“This study indicates that 3 months into its existence, ChatGPT shows promise in providing accurate and comprehensive medical information. However, it still falls short of complete reliability.”
Conclusion: Potential and Room for Improvement
Overall, ChatGPT demonstrated commendable performance in terms of completeness and accuracy. However, the average score was notably lower than the median score, suggesting that a few highly inaccurate responses affected the overall average. The convincing and authoritative delivery of these responses makes it challenging to discern correct information from these “hallucinations.”
The study highlights the importance of continuous updates and algorithm refinement, as well as incorporating user feedback to enhance factual accuracy and reliance on verified sources. Expanding and diversifying training datasets, specifically within medical sources, will enable ChatGPT to grasp the nuances of medical concepts and terminologies more effectively.
It is worth noting that the chatbot currently does not differentiate between high-quality sources like PubMed-indexed journal articles and medical guidelines and low-quality sources such as social media content. Equal weighting of these sources poses a limitation. However, with time and improvement, ChatGPT has the potential to become a valuable tool for medical professionals and patients alike.
Editor Notes
ChatGPT’s capabilities in generating medical information present exciting possibilities for the future of healthcare. While the study sheds light on its current performance and areas for improvement, it highlights the promising potential of AI-powered chatbots in delivering reliable and accurate healthcare information. As technology continues to advance, we can expect further enhancements and refinements in AI-driven medical assistance. To stay updated on the latest advancements in AI and other tech-related news, visit GPT News Room.
from GPT News Room https://ift.tt/AN0eqs6
No comments:
Post a Comment