Vision-Language Models (VLMs) excel at identifying and describing objects but struggle with spatial reasoning such as accurately understanding the relative positions of objects. Inspired by the dual-pathway (ventral-dorsal) model of human vision, we investigate why VLMs fail spatial tasks despite strong object recognition capabilities. Our interpretability-driven analysis reveals a critical underlying cause: vision embeddings in VLMs are treated primarily as semantic ``bag-of-tokens," overshadowing subtle yet crucial positional cues due to their disproportionately large embedding norms. We validate this insight through extensive diagnostic experiments, demonstrating minimal performance impact when token orders or fine-grained spatial details are removed. Guided by these findings, we propose simple, interpretable interventions, including normalizing vision embedding norms and extracting mid-layer spatially rich features, to restore spatial awareness. Empirical results on both our synthetic data and standard benchmarks demonstrate improved spatial reasoning capabilities, highlighting the value of interpretability-informed design choices. Our study not only uncovers fundamental limitations in current VLM architectures but also provides actionable insights for enhancing structured perception of visual scenes.
翻译:视觉语言模型在识别和描述物体方面表现出色,但在空间推理任务(如准确理解物体间相对位置关系)上存在明显不足。受人类视觉双通路(腹侧-背侧)模型的启发,我们探究了为何视觉语言模型在具备强大物体识别能力的同时,却在空间任务上表现不佳。通过可解释性驱动的分析,我们揭示了一个关键的内在原因:视觉语言模型中的视觉嵌入主要被视为语义化的“词袋表征”,其过大的嵌入范数掩盖了细微却至关重要的位置线索。我们通过大量诊断实验验证了这一发现,结果表明当去除词序信息或细粒度空间细节时,模型性能仅受到极小影响。基于这些发现,我们提出了简单且可解释的改进方案,包括对视觉嵌入范数进行归一化处理,以及提取中间层富含空间信息的特征,以恢复模型的空间感知能力。在合成数据和标准基准测试上的实证结果表明,改进后的模型空间推理能力得到显著提升,这凸显了基于可解释性信息进行设计选择的重要价值。本研究不仅揭示了当前视觉语言模型架构的根本性局限,更为增强视觉场景的结构化感知能力提供了具有实践指导意义的见解。