Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models

Vision-Language Models (VLMs) excel at identifying and describing objects but struggle with spatial reasoning such as accurately understanding the relative positions of objects. Inspired by the dual-pathway (ventral-dorsal) model of human vision, we investigate why VLMs fail spatial tasks despite strong object recognition capabilities. Our interpretability-driven analysis reveals a critical underlying cause: vision embeddings in VLMs are treated primarily as semantic ``bag-of-tokens," overshadowing subtle yet crucial positional cues due to their disproportionately large embedding norms. We validate this insight through extensive diagnostic experiments, demonstrating minimal performance impact when token orders or fine-grained spatial details are removed. Guided by these findings, we propose simple, interpretable interventions, including normalizing vision embedding norms and extracting mid-layer spatially rich features, to restore spatial awareness. Empirical results on both our synthetic data and standard benchmarks demonstrate improved spatial reasoning capabilities, highlighting the value of interpretability-informed design choices. Our study not only uncovers fundamental limitations in current VLM architectures but also provides actionable insights for enhancing structured perception of visual scenes.

翻译：视觉语言模型在识别和描述物体方面表现出色，但在空间推理任务（如准确理解物体间相对位置关系）上存在明显不足。受人类视觉双通路（腹侧-背侧）模型的启发，我们探究了为何视觉语言模型在具备强大物体识别能力的同时，却在空间任务上表现不佳。通过可解释性驱动的分析，我们揭示了一个关键的内在原因：视觉语言模型中的视觉嵌入主要被视为语义化的“词袋表征”，其过大的嵌入范数掩盖了细微却至关重要的位置线索。我们通过大量诊断实验验证了这一发现，结果表明当去除词序信息或细粒度空间细节时，模型性能仅受到极小影响。基于这些发现，我们提出了简单且可解释的改进方案，包括对视觉嵌入范数进行归一化处理，以及提取中间层富含空间信息的特征，以恢复模型的空间感知能力。在合成数据和标准基准测试上的实证结果表明，改进后的模型空间推理能力得到显著提升，这凸显了基于可解释性信息进行设计选择的重要价值。本研究不仅揭示了当前视觉语言模型架构的根本性局限，更为增强视觉场景的结构化感知能力提供了具有实践指导意义的见解。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日