Medical Multimodal Large Language Models (Med-MLLMs) require egocentric clinical intent understanding for real-world deployment, yet existing benchmarks fail to evaluate this critical capability. To address these challenges, we introduce MedGaze-Bench, the first benchmark leveraging clinician gaze as a Cognitive Cursor to assess intent understanding across surgery, emergency simulation, and diagnostic interpretation. Our benchmark addresses three fundamental challenges: visual homogeneity of anatomical structures, strict temporal-causal dependencies in clinical workflows, and implicit adherence to safety protocols. We propose a Three-Dimensional Clinical Intent Framework evaluating: (1) Spatial Intent: discriminating precise targets amid visual noise, (2) Temporal Intent: inferring causal rationale through retrospective and prospective reasoning, and (3) Standard Intent: verifying protocol compliance through safety checks. Beyond accuracy metrics, we introduce Trap QA mechanisms to stress-test clinical reliability by penalizing hallucinations and cognitive sycophancy. Experiments reveal current MLLMs struggle with egocentric intent due to over-reliance on global features, leading to fabricated observations and uncritical acceptance of invalid instructions.
翻译:医学多模态大语言模型在实际部署中需要具备自我中心临床意图理解能力,然而现有基准测试未能评估这一关键能力。为应对这些挑战,我们提出了MedGaze-Bench——首个利用临床医生凝视作为认知指针来评估外科手术、急诊模拟和诊断解读场景中意图理解能力的基准测试。我们的基准测试解决了三个根本性挑战:解剖结构的视觉同质性、临床工作流中严格的时序因果依赖性,以及对安全规程的隐性遵循。我们提出了三维临床意图评估框架,用于评估:(1)空间意图:在视觉噪声中辨识精确目标的能力;(2)时序意图:通过回顾性与前瞻性推理推断因果逻辑的能力;(3)规范意图:通过安全检查验证规程遵循情况的能力。除准确率指标外,我们引入了陷阱问答机制,通过惩罚幻觉生成与认知盲从现象来压力测试临床可靠性。实验表明,当前多模态大语言模型因过度依赖全局特征而在自我中心意图理解上表现欠佳,导致虚构观察结果及对无效指令的无批判接受。