Medical vision--language models (VLMs) are usually evaluated on intact image--question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer. We introduce medvigil, a 300-case evaluation suite drawn from four public medical VQA sources, supervised end to end by four board-certified radiologists: every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released manifest, and a separate fourth radiologist independent of construction answers every probe to provide the human reference baseline. The release contains 2556 MCQ probes, 240 counterfactual triplets, physician-adjudicated risk-tier and answerability flags, ROI boxes, and a paired open-ended variant. We report seven correctness-conditioned audit metrics that summarise into the medvigil Composite Score (MCS), and audit 16 vision-capable models plus two text-only baselines. The independent radiologist scores MCS 83.3 at silent-failure rate 5.8%, leaving a 14.1-point composite headroom above the strongest audited model (Claude Opus 4.7 at 69.2). The benchmark and evaluation harness are publicly released.
翻译:医疗视觉语言模型(VLM)通常在完整图像-问题对上进行评估,但可信的临床应用要求一个更强的特性:当答案的实证基础失效时,模型必须能够识别。我们通过扰动证据下的静默失败来研究这一点,即视觉所需的医疗问题与虚假前提、措辞扰动、仅知识重写或ROI损坏的图像配对,但模型返回流畅的非拒绝答案。我们引入MedVIGIL,一个从四个公开医疗VQA来源提取的300个案例评估套件,由四位委员会认证的放射科医生端到端监督:每个黄金答案、拒绝选项、候选答案集、释义、虚假前提陷阱、ROI框和临床风险等级均由临床医生撰写。两位主治放射科医生并行标注每个案例,一位高级放射科医生整合发布的清单,另外一位独立于构建过程的第四位放射科医生回答每个探针,以提供人类参考基线。发布内容包含2556个多项选择题探针、240个反事实三元组、医生裁决的风险等级和可回答性标记、ROI框以及配对开放型变体。我们报告了七个条件性正确率审计指标,这些指标总结为MedVIGIL复合得分(MCS),并审计了16个具备视觉能力的模型加上两个仅基于文本的基线。独立放射科医生在静默失败率为5.8%时获得MCS 83.3,比最强审计模型(Claude Opus 4.7,得分69.2)高出14.1个复合得分空间。该基准和评估工具已公开发布。