Can LLMs Accurately Score Medical Diagnoses and Clinical Reasoning?

Amy Rouillard,Sitwala Mundia,Linda Camara,Ziyaad Dangor,Michael Cameron Gramanie,Ismail Kalla,Shabir A. Madhi,Kajal Morar,Marlvin T. Ncube,Haroon Saloojee,Bruce A. Bassett

Evaluating medical AI systems using expert clinician panels is costly and slow, motivating the use of large language models (LLMs) as alternative adjudicators. Here, we evaluate an LLM Jury, composed of three frontier AI models, for scoring 3334 diagnoses on 300 real-world low- and middle-income country (LMIC) hospital cases. Both LLM- and clinician-generated diagnoses are scored against expert panel diagnoses across four dimensions: diagnosis, differential diagnosis, clinical reasoning, and negative treatment risk. The LLM Jury scores are compared with expert and independent re-scoring panel scores to assess error metrics, inter-rater agreement, severe-risk errors, and the effect of post hoc calibration using isotonic regression. In our data, we find that: (i) the uncalibrated LLM Jury scores preserve ordinal agreement with the expert clinician panel scores, but are systematically lower; (ii) the probability of severe-risk errors is lower for the LLM Jury than the human expert re-score panels; (iii) the LLM Jury combined with LLM diagnoses can be used to identify diagnoses at high risk of error, enabling targeted expert review and improved panel efficiency; (iv) the calibrated LLM Jury scores and rankings of diagnosing agents show excellent agreement with those of the primary expert panels; (v) LLM Jury models show no self-preference bias, they did not score diagnoses generated by their own underlying model or models from the same vendor more (or less) favourably than those generated by other models. Together, these results provide evidence that a calibrated LLM Jury is a trustworthy and reliable proxy for expert clinician evaluation in medical AI benchmarking. Confirming these findings in other clinical settings is an important direction for future work.

翻译：[translated abstract in Chinese] 采用临床专家小组评估医学人工智能系统成本高昂且耗时，这促使研究者探索使用大型语言模型作为替代裁决者。本研究使用由三个前沿AI模型组成的"LLM陪审团"，对来自300个真实中低收入国家医院病例的3334项诊断进行评估。我们针对四个维度（诊断、鉴别诊断、临床推理及治疗阴性风险），对LLM生成与临床医生生成的诊断结果分别进行评分，并与专家小组及独立复评小组的评分进行比对，分析误差指标、评分者间一致性、严重风险错误率以及采用等渗回归进行事后校准的效果。数据显示：（i）未校准的LLM陪审团评分与临床专家小组评分保持序次一致性，但系统性偏低；（ii）LLM陪审团出现严重风险错误的概率低于人类专家复评小组；（iii）结合LLM诊断结果，LLM陪审团可识别高错误风险诊断，实现针对性专家复核并提升小组效率；（iv）校准后的LLM陪审团评分及诊断主体排名与初级专家小组结果高度一致；（v）LLM陪审团模型未表现出自我偏好偏差，对自身底层模型或同厂商模型生成的诊断评分，与其他模型生成的评分无显著差异。综合结果表明，在医学AI基准测试中，经校准的LLM陪审团可作为临床专家评估的可信赖可靠替代方案。未来在更多临床场景中验证该结论是重要研究方向。