Faithfulness hallucinations in VQA occur when vision-language models produce fluent yet visually ungrounded answers, severely undermining their reliability in safety-critical applications. Existing detection methods mainly fall into two categories: external verification approaches relying on auxiliary models or knowledge bases, and uncertainty-driven approaches using repeated sampling or uncertainty estimates. The former suffer from high computational overhead and are limited by external resource quality, while the latter capture only limited facets of model uncertainty and fail to sufficiently explore the rich internal signals associated with the diverse failure modes. Both paradigms thus have inherent limitations in efficiency, robustness, and detection performance. To address these challenges, we propose FaithSCAN: a lightweight network that detects hallucinations by exploiting rich internal signals of VLMs, including token-level decoding uncertainty, intermediate visual representations, and cross-modal alignment features. These signals are fused via branch-wise evidence encoding and uncertainty-aware attention. We also extend the LLM-as-a-Judge paradigm to VQA hallucination and propose a low-cost strategy to automatically generate model-dependent supervision signals, enabling supervised training without costly human labels while maintaining high detection accuracy. Experiments on multiple VQA benchmarks show that FaithSCAN significantly outperforms existing methods in both effectiveness and efficiency. In-depth analysis shows hallucinations arise from systematic internal state variations in visual perception, cross-modal reasoning, and language decoding. Different internal signals provide complementary diagnostic cues, and hallucination patterns vary across VLM architectures, offering new insights into the underlying causes of multimodal hallucinations.
翻译:视觉问答中的可信性幻觉指视觉语言模型生成流畅但缺乏视觉依据的答案,严重削弱了其在安全关键应用中的可靠性。现有检测方法主要分为两类:依赖辅助模型或知识库的外部验证方法,以及采用重复采样或不确定性估计的不确定性驱动方法。前者存在计算开销高且受限于外部资源质量的问题,后者仅能捕捉模型不确定性的有限维度,未能充分挖掘与多样化失效模式相关的丰富内部信号。因此,这两种范式在效率、鲁棒性和检测性能方面均存在固有局限。为应对这些挑战,我们提出FaithSCAN:一种通过利用视觉语言模型的丰富内部信号(包括词元级解码不确定性、中间视觉表示和跨模态对齐特征)来检测幻觉的轻量级网络。这些信号通过分支证据编码和不确定性感知注意力机制进行融合。我们还将LLM-as-a-Judge范式扩展至视觉问答幻觉检测,提出一种低成本策略来自动生成模型相关的监督信号,从而在无需昂贵人工标注的情况下实现监督训练,同时保持高检测精度。在多个视觉问答基准上的实验表明,FaithSCAN在效果和效率方面均显著优于现有方法。深入分析表明,幻觉产生于视觉感知、跨模态推理和语言解码等系统内部状态的变化。不同的内部信号提供互补的诊断线索,且幻觉模式随视觉语言模型架构的不同而变化,这为理解多模态幻觉的成因提供了新的视角。