Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

Hongjian Zhou,Xinyu Zou,Jinge Wu,Sean Wu,Junchi Yu,Bradley Max Segal,Tobias Erich Niebuhr,Sara Amro,Michael Petrus,Sheikh Momin,Alexandra M. Cardoso Pinto,Rachel Niesen,Laura Sophie Wegner,Dhruv Darji,Jung Moses Koo,Joshua Fieggen,Kapil Narain,Mingde Zeng,Lei Clifton,Linda Shapiro,Fenglin Liu,David A. Clifton

Large language models (LLMs) now reach expert-level scores on medical licensing exams, encouraging the assumption that high scores imply safe medical judgment while patients increasingly use them for health advice. We show this assumption is fragile: when misleading context is injected into questions that LLMs originally answer correctly, they abandon the correct answer. We call the ability to maintain correct judgment under adversarial context epistemic resilience, and introduce MedMisBench to measure it. MedMisBench contains 10,932 medical question items and 48,889 misleading context-option pairs spanning medical reasoning, agentic capability, and patient-journey evaluation. Across 11 model configurations, mean accuracy falls from 71.1% on original questions to 38.0% under focused misleading context, with 51.5% attack success. The most damaging injections are formal, rule-like fabrications: authority-framed falsehoods reach 69.5% attack success and exception-poisoning claims reach 64.1%. A 14-member clinical panel from 7 countries identified serious potential harm in 38.2% of reviewed cases. MedMisBench exposes a structural blind spot in LLM evaluation in medical settings: existing benchmarks measure what models know, but not whether they preserve correct medical judgment under misleading context.

翻译：摘要：大语言模型如今在医学执照考试中已达到专家级分数，这助长了"高分即安全"的假设——认为高分意味着可靠的医学判断，而患者却越来越多地使用它们寻求健康建议。我们证明这一假设是脆弱的：当向大语言模型原本正确回答的问题中注入误导性语境时，它们会放弃正确答案。我们将这种在对抗性语境下保持正确判断的能力称为认知韧性，并引入MedMisBench对其进行测量。MedMisBench包含10,932个医学问题条目及48,889个误导性语境-选项对，涵盖医学推理、智能体能力和患者就诊评估三大领域。在11种模型配置下，准确率均值从原始问题的71.1%降至聚焦性误导语境下的38.0%，攻击成功率达51.5%。最具破坏性的注入是形式化、规则式的虚假构造：权威框架下的错误陈述攻击成功率达69.5%，例外投毒式主张达64.1%。来自7个国家的14名临床专家委员会审查发现，38.2%的案例存在严重潜在危害。MedMisBench暴露了医疗场景中大语言模型评估的结构性盲区：现有基准测量的是模型"知道什么"，而非在误导性语境下能否保持正确的医学判断。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

博士论文 | 理解与改进大语言模型推理：从反转诅咒到连续思维链

专知会员服务

12+阅读 · 7月20日

基于大语言模型的医疗推理研究：综述与 MR-Bench 基准测试

专知会员服务

16+阅读 · 4月13日

《幻觉还是事实：国防大型语言模型的可信度评估研究》2025最新109页

专知会员服务

35+阅读 · 2025年9月16日

【新书】设计大型语言模型应用：一种面向LLMs的整体方法

专知会员服务

56+阅读 · 2025年3月16日