Towards Accurate Differential Diagnosis with Large Language Models

Daniel McDuff,Mike Schaekermann,Tao Tu,Anil Palepu,Amy Wang,Jake Garrison,Karan Singhal,Yash Sharma,Shekoofeh Azizi,Kavita Kulkarni,Le Hou,Yong Cheng,Yun Liu,S Sara Mahdavi,Sushant Prakash,Anupam Pathak,Christopher Semturs,Shwetak Patel,Dale R Webster,Ewa Dominowska,Juraj Gottweis,Joelle Barral,Katherine Chou,Greg S Corrado,Yossi Matias,Jake Sunshine,Alan Karthikesalingam,Vivek Natarajan

An accurate differential diagnosis (DDx) is a cornerstone of medical care, often reached through an iterative process of interpretation that combines clinical history, physical examination, investigations and procedures. Interactive interfaces powered by Large Language Models (LLMs) present new opportunities to both assist and automate aspects of this process. In this study, we introduce an LLM optimized for diagnostic reasoning, and evaluate its ability to generate a DDx alone or as an aid to clinicians. 20 clinicians evaluated 302 challenging, real-world medical cases sourced from the New England Journal of Medicine (NEJM) case reports. Each case report was read by two clinicians, who were randomized to one of two assistive conditions: either assistance from search engines and standard medical resources, or LLM assistance in addition to these tools. All clinicians provided a baseline, unassisted DDx prior to using the respective assistive tools. Our LLM for DDx exhibited standalone performance that exceeded that of unassisted clinicians (top-10 accuracy 59.1% vs 33.6%, [p = 0.04]). Comparing the two assisted study arms, the DDx quality score was higher for clinicians assisted by our LLM (top-10 accuracy 51.7%) compared to clinicians without its assistance (36.1%) (McNemar's Test: 45.7, p < 0.01) and clinicians with search (44.4%) (4.75, p = 0.03). Further, clinicians assisted by our LLM arrived at more comprehensive differential lists than those without its assistance. Our study suggests that our LLM for DDx has potential to improve clinicians' diagnostic reasoning and accuracy in challenging cases, meriting further real-world evaluation for its ability to empower physicians and widen patients' access to specialist-level expertise.

翻译：精准的鉴别诊断（DDx）是医疗护理的基石，通常通过结合病史、体格检查、检验及操作的迭代解读过程来实现。基于大语言模型（LLMs）的交互界面为辅助乃至自动化该过程的某些环节提供了新机遇。本研究提出一种专为诊断推理优化的大语言模型，并评估其独立生成鉴别诊断或辅助临床医生的能力。20位临床医生评估了来自《新英格兰医学杂志》（NEJM）病例报告的302例具有挑战性的真实临床案例。每份病例报告由两名临床医生阅读，并被随机分配至两种辅助条件之一：搜索引擎与标准医学资源辅助，或在上述工具基础上增加LLM辅助。所有临床医生在使用相应辅助工具前均提供了基线无辅助鉴别诊断。我们的DDx大语言模型在独立诊断中表现优于无辅助临床医生（top-10准确率59.1% vs 33.6%，[p=0.04]）。对比两个辅助研究组，接受LLM辅助的临床医生DDx质量评分更高（top-10准确率51.7%），显著高于无辅助临床医生（36.1%）（McNemar检验：45.7，p<0.01）及接受搜索引擎辅助的临床医生（44.4%）（4.75，p=0.03）。此外，接受LLM辅助的临床医生提出的鉴别诊断清单较无辅助者更全面。本研究提示，我们的DDx大语言模型有望提升临床医生在疑难病例中的诊断推理能力与准确性，其赋能医师并扩大患者获得专科级诊疗机会的实际价值值得进一步开展真实世界评估。