Background: Artificial intelligence language models have shown promise in various applications, including assisting with clinical decision-making as demonstrated by strong performance of large language models on medical licensure exams. However, their ability to solve complex, open-ended cases, which may be representative of clinical practice, remains unexplored. Methods: In this study, the accuracy of large language AI models GPT4 and GPT3.5 in diagnosing complex clinical cases was investigated using published Case Records of the Massachusetts General Hospital. A total of 50 cases requiring a diagnosis and diagnostic test published from January 1, 2022 to April 16, 2022 were identified. For each case, models were given a prompt requesting the top three specific diagnoses and associated diagnostic tests, followed by case text, labs, and figure legends. Model outputs were assessed in comparison to the final clinical diagnosis and whether the model-predicted test would result in a correct diagnosis. Results: GPT4 and GPT3.5 accurately provided the correct diagnosis in 26% and 22% of cases in one attempt, and 46% and 42% within three attempts, respectively. GPT4 and GPT3.5 provided a correct essential diagnostic test in 28% and 24% of cases in one attempt, and 44% and 50% within three attempts, respectively. No significant differences were found between the two models, and multiple trials with identical prompts using the GPT3.5 model provided similar results. Conclusions: In summary, these models demonstrate potential usefulness in generating differential diagnoses but remain limited in their ability to provide a single unifying diagnosis in complex, open-ended cases. Future research should focus on evaluating model performance in larger datasets of open-ended clinical challenges and exploring potential human-AI collaboration strategies to enhance clinical decision-making.
翻译:背景:人工智能语言模型在多种应用中展现出潜力,包括协助临床决策——大型语言模型在医学执照考试中的优异表现已证实这一点。然而,其解决可能代表临床实践的复杂开放式病例的能力仍待探索。方法:本研究利用《麻省总医院病例记录》已发表的案例,探究大型语言AI模型GPT4和GPT3.5诊断复杂临床病例的准确性。共纳入2022年1月1日至2022年4月16日期间发表的50例需要诊断及诊断性检测的病例。针对每例病例,模型被提示要求给出前三项具体诊断及相关的诊断性检测,随后输入病例文本、实验室检查和图表说明。通过对比最终临床诊断,评估模型输出的准确性,并判断模型预测的检测是否可引导正确诊断。结果:GPT4和GPT3.5首次尝试时正确诊断率分别为26%和22%,三次尝试内正确诊断率分别提升至46%和42%。在关键诊断性检测方面,GPT4和GPT3.5首次尝试时正确率为28%和24%,三次尝试内正确率分别为44%和50%。两模型间未发现显著差异,且使用GPT3.5模型对相同提示进行多次试验的结果相似。结论:总体而言,这些模型在生成鉴别诊断方面展现出潜在价值,但在复杂开放式病例中提供单一统一诊断的能力仍有限。未来研究应聚焦于在更大规模的开放式临床挑战数据集中评估模型性能,并探索人机协作策略以增强临床决策。