Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely on two concurrent mechanisms to represent such associations. In the language model backbone, intermediate layers represent content-independent spatial relations on top of visual tokens corresponding to objects. However, this mechanism plays only a secondary role in shaping model predictions. Instead, the dominant source of spatial information originates in the vision encoder, whose representations encode the layout of objects and are directly exploited by the language model backbone. Notably, this spatial signal is distributed globally across visual tokens, extending beyond object regions into surrounding background areas. We show that enhancing these vision-derived spatial representations globally across all image tokens improves spatial reasoning performance on naturalistic images. Together, our results clarify how spatial association is computed within VLMs and highlight the central role of vision encoders in enabling spatial reasoning.
翻译:许多多模态任务,如图像描述和视觉问答,要求视觉-语言模型将物体与其属性和空间关系进行关联。然而,目前尚不清楚视觉-语言模型在何处以及如何计算这些关联。本研究证明,视觉-语言模型依赖两种并行机制来表示此类关联。在语言模型主干中,中间层在对应物体的视觉标记之上表示与内容无关的空间关系,但这一机制对模型预测结果的塑造仅起次要作用。相反,空间信息的主要来源源自视觉编码器,其表示编码了物体的布局,并被语言模型主干直接利用。值得注意的是,这种空间信号全局分布于视觉标记中,从物体区域延伸至周围背景区域。我们证明,增强所有图像标记中这些视觉衍生空间表示能够提升自然图像上的空间推理性能。综上,我们的结果阐明了视觉-语言模型中空间关联的计算方式,并凸显了视觉编码器在实现空间推理中的核心作用。