Large Vision-Language Models (LVLMs) have achieved remarkable success across a wide range of multimodal tasks, yet their robustness to spatial variations remains insufficiently understood. In this work, we conduct a systematic study of the spatial bias of LVLMs, examining how models respond when identical key visual information is placed at different locations within an image. Through controlled probing experiments, we observe that current LVLMs often produce inconsistent outputs under such spatial shifts, revealing a clear spatial bias in their semantic understanding. Further analysis indicates that this bias does not stem from the vision encoder, but rather from a mismatch in attention mechanisms between the vision encoder and the large language model, which disrupts the global information flow. Motivated by this insight, we propose Adaptive Global Context Injection (AGCI), a lightweight mechanism that dynamically injects shared global visual context into each image token. AGCI works without architectural modifications, mitigating spatial bias by enhancing the semantic accessibility of image tokens while preserving the model's intrinsic capabilities. Extensive experiments demonstrate that AGCI not only enhances the spatial robustness of LVLMs, but also achieves strong performance on various downstream tasks and hallucination benchmarks.
翻译:大型视觉语言模型(LVLMs)在多模态任务上取得了显著成功,但其对空间变化的鲁棒性仍未得到充分理解。本研究系统性地探究了LVLMs的空间偏差,考察了当相同的关键视觉信息被置于图像中不同位置时模型的响应。通过受控探针实验,我们观察到当前的LVLMs在此类空间变换下常产生不一致的输出,揭示了其语义理解中存在明显的空间偏差。进一步分析表明,这种偏差并非源于视觉编码器,而是来自视觉编码器与大型语言模型之间注意力机制的失配,这种失配破坏了全局信息流。基于这一发现,我们提出了自适应全局上下文注入(AGCI),这是一种轻量级机制,能够动态地将共享的全局视觉上下文注入每个图像令牌中。AGCI无需修改模型架构,通过增强图像令牌的语义可访问性来缓解空间偏差,同时保持模型的内在能力。大量实验表明,AGCI不仅提升了LVLMs的空间鲁棒性,还在多种下游任务和幻觉基准测试中取得了强劲性能。