Although large language models (LLMs) have been assessed for general medical knowledge using medical licensing exams, their ability to effectively support clinical decision-making tasks, such as selecting and using medical calculators, remains uncertain. Here, we evaluate the capability of both medical trainees and LLMs to recommend medical calculators in response to various multiple-choice clinical scenarios such as risk stratification, prognosis, and disease diagnosis. We assessed eight LLMs, including open-source, proprietary, and domain-specific models, with 1,009 question-answer pairs across 35 clinical calculators and measured human performance on a subset of 100 questions. While the highest-performing LLM, GPT-4o, provided an answer accuracy of 74.3% (CI: 71.5-76.9%), human annotators, on average, outperformed LLMs with an accuracy of 79.5% (CI: 73.5-85.0%). With error analysis showing that the highest-performing LLMs continue to make mistakes in comprehension (56.6%) and calculator knowledge (8.1%), our findings emphasize that humans continue to surpass LLMs on complex clinical tasks such as calculator recommendation.
翻译:尽管大型语言模型(LLMs)已通过医学执照考试评估了其通用医学知识,但它们在有效支持临床决策任务(如选择和使用医学计算器)方面的能力仍不确定。本研究评估了医学实习生和LLMs在应对多种临床场景(如风险分层、预后和疾病诊断)时推荐医学计算器的能力。我们测试了八个LLMs(包括开源、专有和领域特定模型),使用涵盖35个临床计算器的1,009个问答对,并在100个问题的子集上测量了人类表现。虽然性能最佳的LLM(GPT-4o)的答案准确率为74.3%(置信区间:71.5-76.9%),但人类标注者的平均准确率达到了79.5%(置信区间:73.5-85.0%),优于LLMs。错误分析显示,性能最佳的LLMs在理解(56.6%)和计算器知识(8.1%)方面仍存在错误,我们的研究结果强调,在计算器推荐等复杂临床任务中,人类依然保持优势。