Large vision-language models (LVLMs) achieve strong performance on visual reasoning tasks but remain highly susceptible to hallucination. Existing detection methods predominantly rely on coarse, whole-image measures of how an object token relates to the input image. This global strategy is limited: hallucinated tokens may exhibit weak but widely scattered correlations across many local regions, which aggregate into deceptively high overall relevance, thus evading the current global hallucination detectors. We begin with a simple yet critical observation: a faithful object token must be strongly grounded in a specific image region. Building on this insight, we introduce a patch-level hallucination detection framework that examines fine-grained token-level interactions across model layers. Our analysis uncovers two characteristic signatures of hallucinated tokens: (i) they yield diffuse, non-localized attention patterns, in contrast to the compact, well-focused attention seen in faithful tokens; and (ii) they fail to exhibit meaningful semantic alignment with any visual region. Guided by these findings, we develop a lightweight and interpretable detection method that leverages patch-level statistical features, combined with hidden-layer representations. Our approach achieves up to 90% accuracy in token-level hallucination detection, demonstrating the superiority of fine-grained structural analysis for detecting hallucinations.
翻译:大型视觉语言模型在视觉推理任务上表现强劲,但仍极易产生幻觉。现有检测方法主要依赖粗粒度的全图度量,即评估对象标记与输入图像的整体关联性。这种全局策略存在局限:幻觉标记可能表现出微弱但广泛分散于多个局部区域的相关性,这些相关性聚集后形成看似较高的整体关联,从而规避当前的全局幻觉检测器。我们从一项简单但关键的观察出发:可信的对象标记必须强定位在特定图像区域。基于此洞察,我们提出一种补丁级幻觉检测框架,通过跨模型层分析细粒度标记级交互。我们的分析揭示了幻觉标记的两个特征性信号:(i) 它们产生弥散、非局部化的注意力模式,与可信标记的紧凑、聚焦注意力形成对比;(ii) 它们未能与任何视觉区域建立有意义的语义对齐。受这些发现启发,我们开发了一种轻量级且可解释的检测方法,利用补丁级统计特征,并结合隐藏层表示。我们的方法在标记级幻觉检测中达到90%的准确率,证明了细粒度结构分析在检测幻觉方面的优越性。