UniVRSE: Unified Vision-conditioned Response Semantic Entropy for Hallucination Detection in Medical Vision-Language Models

Vision-language models (VLMs) have great potential for medical image understanding, particularly in Visual Report Generation (VRG) and Visual Question Answering (VQA), but they may generate hallucinated responses that contradict visual evidence, limiting clinical deployment. Although uncertainty-based hallucination detection methods are intuitive and effective, they are limited in medical VLMs. Specifically, Semantic Entropy (SE), effective in text-only LLMs, becomes less reliable in medical VLMs due to their overconfidence from strong language priors. To address this challenge, we propose UniVRSE, a Unified Vision-conditioned Response Semantic Entropy framework for hallucination detection in medical VLMs. UniVRSE strengthens visual guidance during uncertainty estimation by contrasting the semantic predictive distributions derived from an original image-text pair and a visually distorted counterpart, with higher entropy indicating hallucination risk. For VQA, UniVRSE works on the image-question pair, while for VRG, it decomposes the report into claims, generates verification questions, and applies vision-conditioned entropy estimation at the claim level. To evaluate hallucination detection, we propose a unified pipeline that generates responses on medical datasets and derives hallucination labels via factual consistency assessment. However, current evaluation methods rely on subjective criteria or modality-specific rules. To improve reliability, we introduce Alignment Ratio of Atomic Facts (ALFA), a novel method that quantifies fine-grained factual consistency. ALFA-derived labels provide ground truth for robust benchmarking. Experiments on six medical VQA/VRG datasets and three VLMs show UniVRSE significantly outperforms existing methods with strong cross-modal generalization.

翻译：视觉语言模型（VLMs）在医学图像理解方面具有巨大潜力，特别是在视觉报告生成（VRG）和视觉问答（VQA）任务中，但它们可能生成与视觉证据相矛盾的幻觉响应，这限制了其临床部署。尽管基于不确定性的幻觉检测方法直观且有效，但在医学VLMs中存在局限性。具体而言，在纯文本LLMs中有效的语义熵（SE），在医学VLMs中由于强大的语言先验导致的过度自信而变得可靠性降低。为应对这一挑战，我们提出了UniVRSE，一个用于医学VLMs幻觉检测的统一视觉条件响应语义熵框架。UniVRSE通过对比源自原始图像-文本对和视觉失真对应物的语义预测分布，在不确定性估计过程中加强视觉引导，更高的熵值表明幻觉风险。对于VQA任务，UniVRSE作用于图像-问题对；而对于VRG任务，它将报告分解为声明，生成验证问题，并在声明级别应用视觉条件熵估计。为了评估幻觉检测性能，我们提出了一个统一的流程，该流程在医学数据集上生成响应，并通过事实一致性评估得出幻觉标签。然而，当前的评估方法依赖于主观标准或特定模态的规则。为了提高可靠性，我们引入了原子事实对齐率（ALFA），这是一种量化细粒度事实一致性的新方法。ALFA导出的标签为稳健的基准测试提供了真实依据。在六个医学VQA/VRG数据集和三个VLMs上的实验表明，UniVRSE显著优于现有方法，并展现出强大的跨模态泛化能力。