Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

Multimodal Foundation Models are increasingly used as reasoning agents, making reliability, knowing when a model may hallucinate, critical. A common intuition, which we call the Attention-Confidence Assumption, holds that reliability follows from "structural" visual perception: tight attention on relevant regions should signal a trustworthy answer, while scattered attention signals confusion. We challenge this through the VLM Reliability Probe (VRP), a systematic cross-family study of reliability signals in contemporary Vision-Language Models (VLMs). We introduce structural-attention metrics, cluster counts (C_k) and spatial entropy (H_s), to quantify the visual encoder's gaze, and track its evolution (Delta H_s) across layers. This reveals a "Symbolic Detachment": models often "Early Lock" visual features only to diffuse attention later, severing early perception from final generation. Contrary to the grounding hypothesis, we find a "Cluster Failure": spatial attention has near-zero correlation (R approx 0.001) with accuracy. Instead, reliability is a phenomenon of generation dynamics and internal-state distributions. Self-Consistency, the agreement rate across sampled reasoning paths, is the dominant predictor of truth (R = 0.429). Scaling causal interventions exposes a sharp architectural divergence: LLaVA locks its prediction in a fragile late-stage bottleneck, whereas PaliGemma and Qwen2-VL distribute reliability globally, staying resilient even when ~50% or more of their most predictive layer is destroyed. For current VLMs, reliability signals are detached from visual grounding maps and are best inferred from generation-time dynamics and hidden-state probes.

翻译：多模态基础模型越来越多地被用作推理代理，因此其可靠性——即模型何时可能产生幻觉——变得至关重要。一种常见的直觉，我们称之为“注意力-置信度假设”，认为可靠性源于“结构性”视觉感知：对相关区域的紧密注意力应预示着可信的答案，而分散的注意力则表明混乱。我们通过VLM可靠性探针（VRP）挑战这一观点，这是一项针对当代视觉-语言模型（VLM）中可靠性信号的系统性跨族研究。我们引入结构注意力度量——聚类数量（C_k）和空间熵（H_s）——来量化视觉编码器的注视行为，并追踪其跨层演化（ΔH_s）。这揭示了一种“符号性脱离”：模型常常“早期锁定”视觉特征，但在后续层扩散注意力，从而割裂早期感知与最终生成。与接地假设相反，我们发现一个“聚类失效”：空间注意力与准确性之间的相关性几乎为零（R ≈ 0.001）。相反，可靠性是生成动态和内部状态分布的现象。自我一致性，即采样推理路径间的一致率，是真实性的主导预测因子（R = 0.429）。扩展因果干预暴露了显著的结构性分歧：LLaVA将其预测锁定在脆弱的后期瓶颈中，而PaliGemma和Qwen2-VL则将可靠性全局分布，即使其最具预测性的层被破坏约50%或更多，仍保持鲁棒性。对于当前的VLM，可靠性信号与视觉接地图相脱离，最好通过生成时动态和隐藏状态探针来推断。