Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.
翻译:大语言模型现在能在医学执照考试中达到专家级分数,这助长了一种假设——高分意味着安全的医学判断,而患者越来越多地向它们寻求健康建议。我们证明这一假设十分脆弱:当在模型原本回答正确的问题中注入误导性语境时,它们会放弃正确答案。我们将模型在对抗性语境下保持正确判断的能力称为认知韧性,并引入MedMisBench对其进行测量。MedMisBench包含10,932个医学问题条目和48,889对误导性语境与选项,涵盖医学推理、智能体能力和患者旅程评估。在11种模型配置下,平均准确率从原始问题的71.1%降至聚焦性误导语境下的38.0%,攻击成功率为51.5%。最具破坏性的注入是形式化、规则式的虚构:以权威框架包装的虚假信息攻击成功率达69.5%,例外投毒式主张达64.1%。来自7个国家的14名临床专家评审组认定,38.2%的审查案例存在严重潜在危害。MedMisBench暴露了医学场景中大语言模型评估的结构性盲区:现有基准衡量模型"知道什么",却不衡量其在误导性语境下能否保持正确的医学判断。