Vision Language Models (VLMs) are increasingly used for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no obvious integrity violations). Existing benchmarks largely assume this step is solved, and therefore miss a critical failure mode: a model can produce plausible narratives even when the input is inconsistent or invalid. We introduce MedObvious, a 1,880-task benchmark that isolates input validation as a set-level consistency capability over small multi-panel image sets: the model must identify whether any panel violates expected coherence. MedObvious spans five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues, and includes five evaluation formats to test robustness across interfaces. Evaluating 17 different VLMs, we find that sanity checking remains unreliable: several models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. These results show that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment.
翻译:视觉语言模型(VLM)日益被用于医学报告生成和视觉问答等任务。然而,流畅的诊断文本并不能保证安全的视觉理解。在临床实践中,解读始于诊断前的合理性检查:验证输入是否可读(正确的模态和解剖结构、合理的视角和方位,以及无明显的完整性违规)。现有基准大多预设这一步已解决,因此遗漏了一个关键故障模式:即使输入不一致或无效,模型仍能生成看似合理的叙述。我们提出MedObvious,这是一个包含1,880个任务的基准,将输入验证作为一种集合级一致性能力进行孤立测试,针对小型多图图像集:模型必须识别是否有任何图像面板违反了预期一致性。MedObvious涵盖五个递进层级,从基本的方位/模态不匹配到临床驱动的解剖/视角验证及分诊式线索,并包含五种评估格式以测试跨接口的鲁棒性。评估17种不同的VLM后,我们发现合理性检查仍不可靠:若干模型对正常(阴性对照)输入产生异常幻觉,性能在扩展至更大图像集时下降,且测量精度在多选和开放式设置之间存在显著差异。这些结果表明,针对医学VLM的诊断前验证仍未解决,且应在部署前将其视为一项独立的、安全关键的能力。