MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

Hanqi Jiang,Junhao Chen,Mingyu Kang,Hyeokjae Kwon,Yi Pan,Lifeng Chen,Weihang You,Haozhen Gong,Ruiyu Yan,Jinglei Lv,Lin Zhao,Hui Ren,Quanzheng Li,Tianming Liu,Xiang Li

Medical vision--language models (VLMs) are usually evaluated on intact image--question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer. We introduce medvigil, a 300-case evaluation suite drawn from four public medical VQA sources, supervised end to end by four board-certified radiologists: every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released manifest, and a separate fourth radiologist independent of construction answers every probe to provide the human reference baseline. The release contains 2556 MCQ probes, 240 counterfactual triplets, physician-adjudicated risk-tier and answerability flags, ROI boxes, and a paired open-ended variant. We report seven correctness-conditioned audit metrics that summarise into the medvigil Composite Score (MCS), and audit 16 vision-capable models plus two text-only baselines. The independent radiologist scores MCS 83.3 at silent-failure rate 5.8%, leaving a 14.1-point composite headroom above the strongest audited model (Claude Opus 4.7 at 69.2). The benchmark and evaluation harness are publicly released.

翻译：医疗视觉语言模型（VLM）通常在完整图像-问题对上进行评估，但可信的临床应用要求一个更强的特性：当答案的实证基础失效时，模型必须能够识别。我们通过扰动证据下的静默失败来研究这一点，即视觉所需的医疗问题与虚假前提、措辞扰动、仅知识重写或ROI损坏的图像配对，但模型返回流畅的非拒绝答案。我们引入MedVIGIL，一个从四个公开医疗VQA来源提取的300个案例评估套件，由四位委员会认证的放射科医生端到端监督：每个黄金答案、拒绝选项、候选答案集、释义、虚假前提陷阱、ROI框和临床风险等级均由临床医生撰写。两位主治放射科医生并行标注每个案例，一位高级放射科医生整合发布的清单，另外一位独立于构建过程的第四位放射科医生回答每个探针，以提供人类参考基线。发布内容包含2556个多项选择题探针、240个反事实三元组、医生裁决的风险等级和可回答性标记、ROI框以及配对开放型变体。我们报告了七个条件性正确率审计指标，这些指标总结为MedVIGIL复合得分（MCS），并审计了16个具备视觉能力的模型加上两个仅基于文本的基线。独立放射科医生在静默失败率为5.8%时获得MCS 83.3，比最强审计模型（Claude Opus 4.7，得分69.2）高出14.1个复合得分空间。该基准和评估工具已公开发布。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

[ICML 2026] 看见的还是思考的？用奖励机制区分“看错”与“想错”：视觉语言模型奖励感知

专知会员服务

10+阅读 · 5月15日

在无标注条件下适配视觉—语言模型：全面综述

专知会员服务

13+阅读 · 2025年8月9日

视觉语言建模遇见遥感：模型、数据集与前景展望

专知会员服务

17+阅读 · 2025年5月21日

高效视觉语言模型研究综述

专知会员服务

14+阅读 · 2025年4月18日