人类在复杂临床决策中持续优于大型语言模型：一项基于医学计算器的研究 (Humans Continue to Outperform Large Language Models in Complex Clinical Decision-Making: A Study with Medical Calculators)

Although large language models (LLMs) have been assessed for general medical knowledge using medical licensing exams, their ability to effectively support clinical decision-making tasks, such as selecting and using medical calculators, remains uncertain. Here, we evaluate the capability of both medical trainees and LLMs to recommend medical calculators in response to various multiple-choice clinical scenarios such as risk stratification, prognosis, and disease diagnosis. We assessed eight LLMs, including open-source, proprietary, and domain-specific models, with 1,009 question-answer pairs across 35 clinical calculators and measured human performance on a subset of 100 questions. While the highest-performing LLM, GPT-4o, provided an answer accuracy of 74.3% (CI: 71.5-76.9%), human annotators, on average, outperformed LLMs with an accuracy of 79.5% (CI: 73.5-85.0%). With error analysis showing that the highest-performing LLMs continue to make mistakes in comprehension (56.6%) and calculator knowledge (8.1%), our findings emphasize that humans continue to surpass LLMs on complex clinical tasks such as calculator recommendation.

翻译：尽管大型语言模型（LLMs）已通过医学执照考试评估了其通用医学知识，但它们在有效支持临床决策任务（如选择和使用医学计算器）方面的能力仍不确定。本研究评估了医学实习生和LLMs在应对多种临床场景（如风险分层、预后和疾病诊断）时推荐医学计算器的能力。我们测试了八个LLMs（包括开源、专有和领域特定模型），使用涵盖35个临床计算器的1,009个问答对，并在100个问题的子集上测量了人类表现。虽然性能最佳的LLM（GPT-4o）的答案准确率为74.3%（置信区间：71.5-76.9%），但人类标注者的平均准确率达到了79.5%（置信区间：73.5-85.0%），优于LLMs。错误分析显示，性能最佳的LLMs在理解（56.6%）和计算器知识（8.1%）方面仍存在错误，我们的研究结果强调，在计算器推荐等复杂临床任务中，人类依然保持优势。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日