Vision-Language Models (VLMs) often hallucinate objects that are not present in the input image. We identify a contributing cause of this behavior, which we term spatial credit collapse: in early transformer layers, hidden-state activation concentrates on a small number of visual patches, suppressing surrounding contextual evidence and increasing reliance on language priors. Across seven models we observe a strong correlation between visual attention entropy and hallucination rate (r = -0.65, p < 0.001), suggesting that reduced spatial credit diversity contributes to hallucination. To address this issue we propose Spatial Credit Redistribution (SCR), a training-free inference-time method. SCR uses a lightweight two-pass procedure. A diagnostic pass identifies the top-K high-attention source patches and their spatial neighbors. A redistribution pass then scales each source by 1/lambda (~0.91) and injects a (lambda - 1) weighted copy of its hidden state into neighboring patches, restoring suppressed visual context without modifying model weights. Because the diagnostic pass is performed once per image and reused across the output sequence, the added latency is negligible (<0.5 ms per token for 100-token responses). We evaluate SCR across seven model configurations from four VLM families (Chameleon, LLaVA-1.5, Qwen-VL/Qwen2-VL, and InternVL2) on five benchmarks: POPE, CHAIR, MME, HallusionBench, and AMBER. SCR reduces POPE-Adversarial hallucination by 4.6-6.0 percentage points and CHAIR-s by 41-51 percent while preserving caption quality (CIDEr drop <=0.8). Compared with prior inference-time methods including OPERA, VCD, OA-VCD, DoLa, VLI, SID, and CRoPS, SCR achieves a better trade-off between hallucination reduction, generation quality, and latency.
翻译:视觉语言模型(VLMs)经常在输入图像中产生不存在物体的幻觉。我们识别出导致该行为的一个关键因素,称为空间信用坍缩:在Transformer早期层中,隐藏状态激活集中于少量视觉图像块,抑制了周围上下文证据并增强了对语言先验的依赖。通过对七个模型的观察,我们发现视觉注意力熵与幻觉率之间存在强相关性(r = -0.65, p < 0.001),表明空间信用多样性的降低会助长幻觉。为解决此问题,我们提出空间信用重分配(SCR),一种无需训练即可在推理时使用的方法。SCR采用轻量级双阶段流程:诊断阶段识别前K个高注意力源图像块及其空间邻域;重分配阶段随后将每个源的注意力缩放1/λ(约0.91),并将其隐藏状态的(λ-1)加权副本注入相邻图像块,从而在不修改模型权重的情况下恢复被抑制的视觉上下文。由于诊断阶段每幅图像仅需执行一次并可在输出序列中复用,所增加的延迟可忽略不计(对于100个标记的响应,每个标记延迟<0.5毫秒)。我们在四个VLM系列(Chameleon、LLaVA-1.5、Qwen-VL/Qwen2-VL和InternVL2)的七种模型配置上,通过五个基准测试(POPE、CHAIR、MME、HallusionBench和AMBER)评估SCR。该方法将POPE-Adversarial幻觉降低4.6-6.0个百分点,将CHAIR-s降低41-51%,同时保持描述质量(CIDEr下降≤0.8)。与现有推理时方法(包括OPERA、VCD、OA-VCD、DoLa、VLI、SID和CRoPS)相比,SCR在幻觉抑制、生成质量和延迟之间实现了更优的权衡。