First, do NOHARM: towards clinically safe large language models

David Wu,Fateme Nateghi Haredasht,Saloni Kumar Maharaj,Priyank Jain,Jessica Tran,Matthew Gwiazdon,Arjun Rustagi,Jenelle Jindal,Jacob M. Koshy,Vinay Kadiyala,Anup Agarwal,Bassman Tappuni,Brianna French,Sirus Jesudasen,Christopher V. Cosgriff,Rebanta Chakraborty,Jillian Caldwell,Susan Ziolkowski,David J. Iberri,Robert Diep,Rahul S. Dalal,Kira L. Newman,Kristin Galetta,J. Carl Pallais,Nancy Wei,Kathleen M. Buchheit,David I. Hong,Vartan Pahalyants,Ernest Y. Lee,Allen Shih,Tamara B. Kaplan,Vishnu Ravi,Sarita Khemani,Thomas A. Buckley,April S. Liang,Daniel Shirvani,Advait Patil,Nicholas Marshall,Kanav Chopra,Joel Koh,Adi Badhwar,Anastasia Perez,Austin J. Schoeffler,Mahbuba Tusty,Chase M. Walton,Liam G. McCoy,David J. H. Wu,Yingjie Weng,Sumant Ranji,Kevin Schulman,Nigam H. Shah,Jason Hom,Arnold Milstein,Arjun K. Manrai,Adam Rodman,Jonathan H. Chen,Ethan Goh

Large language models (LLMs) are routinely used by physicians and patients for medical advice, yet their clinical safety profiles remain poorly characterized. We present NOHARM (Numerous Options Harm Assessment for Risk in Medicine), a 1,100-task benchmark of primary care-to-specialist consultation cases to measure the frequency and severity of harm from LLM-generated medical recommendations. NOHARM covers 10 specialties, with 12,747 expert annotations for 4,249 clinical management options. Across 28 LLMs, recommendations carried the potential for severe harm in up to 22.6% of cases, with errors of omission accounting for more than 80% of severe errors. In a randomized trial of 101 generalist physicians, human benchmark performance significantly improved with AI assistance, yet physicians remained far from realizing the potential of AI tools, frequently ignoring essential advice surfaced by AI. Safety performance tracked general-intelligence and medical-knowledge benchmarks across the full range of models but decoupled at the frontier. Despite strong performance on existing evaluations, widely used AI models can produce medical advice with the potential for severe harm at non-trivial rates, highlighting the importance of explicit measurement of clinical safety.

翻译：大语言模型（LLMs）已被医生和患者广泛用于获取医疗建议，但其临床安全性特征尚不明确。我们提出NOHARM（Numerous Options Harm Assessment for Risk in Medicine，即医学风险多选项危害评估）基准，包含1,100个从初级医疗到专科会诊的病例，用于测量LLMs生成医疗建议的危害频率与严重程度。该基准涵盖10个专科领域，针对4,249个临床管理选项，共包含12,747条专家标注。对28个LLMs的评估显示，在高达22.6%的病例中，模型建议存在严重危害风险，其中遗漏性错误占严重错误的80%以上。在一项涉及101名全科医生的随机试验中，AI辅助显著提升了人类基准测试表现，但医生仍远未充分释放AI工具的潜力，经常忽略AI提示的关键建议。安全性表现与全量模型的通用智能和医学知识基准呈正相关，但在前沿模型中出现解耦现象。尽管现有评估中表现优异，广泛使用的AI模型仍以不可忽视的概率生成具有严重危害潜力的医疗建议，凸显了临床安全显式测量的重要性。