Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA

Self-verification, re-invoking the same vision language model (VLM) in a fresh context to check its own generated answer, is increasingly used as a default safety layer for medical visual question answering (VQA). We argue that this practice is fundamentally unreliable. We introduce [METHOD NAME], a diagnostic framework for mapping the reliability boundary of medical VLM self-verification by decomposing verifier behavior into discrimination capability and agreement bias. Because the verifier and answer generator are capacity-coupled, the verifier can overly agree with the generator, creating a verification mirage: a regime with both high verifier error and high agreement bias, driven by false acceptance of incorrect answers. Evaluating six open-weight VLMs across five medical VQA datasets and seven medical tasks, we find that this boundary is strongly task-conditioned. Knowledge-intensive clinical tasks fall deepest into the mirage, simpler tasks are more resistant, and perceptual tasks lie in between. Verification also fails to provide an independent safety signal: logistic mixed-effects analysis shows that verifier error and agreement bias become more likely when the generator is wrong, while saliency analyses show that verifiers under-attend to image evidence relative to generators, a phenomenon we call the lazy verifier. Cross-verification reduces but does not eliminate the mirage. Moreover, when verification is reused in multi-turn actor-verifier loops, most initially wrong answers become locked in by false verification. Since our experiments use clean benchmarks, the observed reliability boundary likely underestimates failures in real clinical deployment.

翻译：自验证——在全新上下文中重新调用同一视觉语言模型（VLM）以检查其自身生成的答案——正越来越多地被用作医学视觉问答（VQA）的默认安全层。我们认为这种做法本质上不可靠。我们提出[METHOD NAME]诊断框架，通过将验证器行为分解为判别能力与一致偏差，来映射医学VLM自验证的可靠边界。由于验证器与答案生成器存在能力耦合，验证器可能过度认同生成器，从而产生验证错觉：一个同时具有高验证器误差与高一致偏差的状态，其根源在于对错误答案的虚假接受。通过评估六个开源权重VLM在五个医学VQA数据集及七项医学任务上的表现，我们发现该边界具有强烈的任务依赖性。知识密集型临床任务最深陷于错觉之中，简单任务更具抵抗性，而感知任务则介于两者之间。验证同样未能提供独立的信号：逻辑混合效应分析显示，当生成器出错时，验证器误差与一致偏差更易发生；而显著性分析则表明，相较于生成器，验证器对图像证据的关注不足（我们将此现象称为“懒惰验证器”）。交叉验证虽能减轻但无法消除错觉。此外，当验证在多轮行动-验证循环中被重复使用时，大多数最初错误的答案因虚假验证而被锁定。由于我们的实验基于干净基准测试，观察到的可靠边界很可能低估了实际临床部署中的失效情况。