Evaluation format, not model capability, drives triage failure in the assessment of consumer health AI

Ramaswamy et al. reported in Nature Medicine that ChatGPT Health under-triages 51.6% of emergencies, concluding that consumer-facing AI triage poses safety risks. However, their evaluation used an exam-style protocol -- forced A/B/C/D output, knowledge suppression, and suppression of clarifying questions -- that differs fundamentally from how consumers use health chatbots. We tested five frontier LLMs (GPT-5.2, Claude Sonnet 4.6, Claude Opus 4.6, Gemini 3 Flash, Gemini 3.1 Pro) on a 17-scenario partial replication bank under constrained (exam-style, 1,275 trials) and naturalistic (patient-style messages, 850 trials) conditions, with targeted ablations and prompt-faithful checks using the authors' released prompts. Naturalistic interaction improved triage accuracy by 6.4 percentage points ($p = 0.015$). Diabetic ketoacidosis was correctly triaged in 100% of trials across all models and conditions. Asthma triage improved from 48% to 80%. The forced A/B/C/D format was the dominant failure mechanism: three models scored 0--24% with forced choice but 100% with free text (all $p < 10^{-8}$), consistently recommending emergency care in their own words while the forced-choice format registered under-triage. Prompt-faithful checks on the authors' exact released prompts confirmed the scaffold produces model-dependent, case-dependent results. Our results suggest that the headline under-triage rate is highly contingent on evaluation format and may not generalize as a stable estimate of deployed triage behavior. Valid evaluation of consumer health AI requires testing under conditions that reflect actual use.

翻译：Ramaswamy等人在《自然·医学》中报告称ChatGPT Health对51.6%的急诊病例存在分诊不足（under-triage），由此得出结论：面向消费者的AI分诊存在安全风险。然而，其评估采用了考试式协议——强制A/B/C/D输出、知识抑制及澄清问题抑制——这与消费者实际使用健康聊天机器人的方式存在本质差异。我们测试了五种前沿大语言模型（GPT-5.2、Claude Sonnet 4.6、Claude Opus 4.6、Gemini 3 Flash、Gemini 3.1 Pro），在17个场景的部分复现库中，分别采用受限条件（考试式，1275次试验）与自然交互条件（患者式消息，850次试验），并利用作者公开的提示词进行了针对性消融实验与提示词忠实性检查。自然交互使分诊准确率提升6.4个百分点（$p = 0.015$）。所有模型在所有试验条件下对糖尿病酮症酸中毒的分诊正确率均达100%。哮喘分诊正确率从48%提升至80%。强制A/B/C/D格式是主要失败机制：三个模型在强制选择模式下得分0-24%，但在自由文本模式下得分100%（所有$p < 10^{-8}$），其在自主表述中一致推荐急诊护理，而强制选择格式则表现为分诊不足。对作者公开发布提示词的提示词忠实性检查证实，该评估框架会产生依赖模型和依赖案例的结果。我们的研究结果表明，关键的分诊不足率高度依赖于评估格式，可能无法作为部署后分诊行为的稳定估计量进行泛化。对消费者健康AI的有效评估需在反映实际使用场景的条件下进行测试。

相关内容

健康

关注 27

健康是指一个人在身体、精神和社会等方面都处于良好的状态。健康包括两个方面的内容：

一是主要脏器无疾病，身体形态发育良好，体形均匀，人体各系统具有良好的生理功能，有较强的身体活动能力和劳动能力，这是对健康最基本的要求；

二是对疾病的抵抗能力较强，能够适应环境变化，各种生理刺激以及致病因素对身体的作用。传统的健康观是“无病即健康”，现代人的健康观是整体健康，世界卫生组织提出“健康不仅是躯体没有疾病，还要具备心理健康、社会适应良好和有道德”。因此，现代人的健康内容包括：躯体健康、心理健康、心灵健康、社会健康、智力健康、道德健康、环境健康等。健康是人的基本权利。健康是人生的第一财富。

《人工智能模型风险目录：开发者与研究者对现实世界AI危害的认知盲区》

专知会员服务

18+阅读 · 2025年8月28日

AI在医疗中的安全挑战

专知会员服务

19+阅读 · 2024年10月5日

大型语言模型疾病诊断综述

专知会员服务

32+阅读 · 2024年9月21日