V-Loop: Visual Logical Loop Verification for Hallucination Detection in Medical Visual Question Answering

Multimodal Large Language Models (MLLMs) have shown remarkable capability in assisting disease diagnosis in medical visual question answering (VQA). However, their outputs remain vulnerable to hallucinations (i.e., responses that contradict visual facts), posing significant risks in high-stakes medical scenarios. Recent introspective detection methods, particularly uncertainty-based approaches, offer computational efficiency but are fundamentally indirect, as they estimate predictive uncertainty for an image-question pair rather than verifying the factual correctness of a specific answer. To address this limitation, we propose Visual Logical Loop Verification (V-Loop), a training-free and plug-and-play framework for hallucination detection in medical VQA. V-Loop introduces a bidirectional reasoning process that forms a visually grounded logical loop to verify factual correctness. Given an input, the MLLM produces an answer for the primary input pair. V-Loop extracts semantic units from the primary QA pair, generates a verification question by conditioning on the answer unit to re-query the question unit, and enforces visual attention consistency to ensure answering both primary question and verification question rely on the same image evidence. If the verification answer matches the expected semantic content, the logical loop closes, indicating factual grounding; otherwise, the primary answer is flagged as hallucinated. Extensive experiments on multiple medical VQA benchmarks and MLLMs show that V-Loop consistently outperforms existing introspective methods, remains highly efficient, and further boosts uncertainty-based approaches when used in combination.

翻译：多模态大语言模型（MLLMs）在辅助医学视觉问答（VQA）中的疾病诊断方面展现出卓越能力。然而，其输出仍易产生幻觉（即与视觉事实相矛盾的响应），这在高风险医疗场景中构成重大风险。近期的内省式检测方法，尤其是基于不确定性的方法，虽提供了计算效率，但本质上是间接的，因为它们估计的是图像-问题对的预测不确定性，而非验证特定答案的事实正确性。为应对这一局限，我们提出视觉逻辑循环验证（V-Loop），一个免训练、即插即用的框架，用于医学VQA中的幻觉检测。V-Loop引入了一个双向推理过程，形成一个基于视觉的逻辑循环以验证事实正确性。给定输入，MLLM为主输入对生成一个答案。V-Loop从主问答对中提取语义单元，通过以答案单元为条件重新查询问题单元来生成一个验证问题，并强制视觉注意力一致性，以确保回答主问题和验证问题都依赖于相同的图像证据。如果验证答案与预期语义内容匹配，则逻辑循环闭合，表明事实有据；否则，主答案被标记为幻觉。在多个医学VQA基准和MLLMs上的大量实验表明，V-Loop始终优于现有内省方法，保持高效性，并在与基于不确定性的方法结合使用时能进一步提升其性能。