Despite achieving high accuracy on medical benchmarks, LLMs exhibit the Einstellung Effect in clinical diagnosis--relying on statistical shortcuts rather than patient-specific evidence, causing misdiagnosis in atypical cases. Existing benchmarks fail to detect this critical failure mode. We introduce MedEinst, a counterfactual benchmark with 5,383 paired clinical cases across 49 diseases. Each pair contains a control case and a "trap" case with altered discriminative evidence that flips the diagnosis. We measure susceptibility via Bias Trap Rate--probability of misdiagnosing traps despite correctly diagnosing controls. Extensive Evaluation of 17 LLMs shows frontier models achieve high baseline accuracy but severe bias trap rates. Thus, we propose ECR-Agent, aligning LLM reasoning with Evidence-Based Medicine standard via two components: (1) Dynamic Causal Inference (DCI) performs structured reasoning through dual-pathway perception, dynamic causal graph reasoning across three levels (association, intervention, counterfactual), and evidence audit for final diagnosis; (2) Critic-Driven Graph and Memory Evolution (CGME) iteratively refines the system by storing validated reasoning paths in an exemplar base and consolidating disease-specific knowledge into evolving illness graphs. Source code is to be released.
翻译:尽管在医学基准测试中取得了高准确率,但大语言模型在临床诊断中表现出思维定式效应——依赖统计捷径而非患者特异性证据,导致在非典型病例中误诊。现有基准测试未能检测到这一关键失效模式。我们提出了MedEinst,一个包含49种疾病、共计5,383对临床病例的反事实基准。每对病例包含一个对照病例和一个"陷阱"病例,后者改变了可导致诊断反转的鉴别性证据。我们通过"偏误陷阱率"(即在正确诊断对照病例的情况下仍误诊陷阱病例的概率)来衡量模型的易感性。对17个大语言模型的广泛评估表明,前沿模型虽然获得了较高的基线准确率,但存在严重的偏误陷阱率。因此,我们提出了ECR-Agent,通过两个组件将大语言模型的推理与循证医学标准对齐:(1)动态因果推理通过双通路感知、跨三个层次(关联、干预、反事实)的动态因果图推理以及用于最终诊断的证据审核,执行结构化推理;(2)批评者驱动的图与记忆进化通过将已验证的推理路径存储到范例库中,并将疾病特异性知识整合到不断演化的疾病图中,迭代地优化系统。源代码即将发布。