Spatial reasoning -- the ability to perceive and reason about relationships in space -- advances vision-language models (VLMs) from visual perception toward spatial semantic understanding. Existing approaches either revisit local image patches, improving fine-grained perception but weakening global spatial awareness, or mark isolated coordinates, which capture object locations but overlook their overall organization. In this work, we integrate the cognitive concept of an object-centric blueprint into VLMs to enhance spatial reasoning. Given an image and a question, the model first constructs a JSON-style blueprint that records the positions, sizes, and attributes of relevant objects, and then reasons over this structured representation to produce the final answer. To achieve this, we introduce three key techniques: (1) blueprint-embedded reasoning traces for supervised fine-tuning to elicit basic reasoning skills; (2) blueprint-aware rewards in reinforcement learning to encourage the blueprint to include an appropriate number of objects and to align final answers with this causal reasoning; and (3) anti-shortcut data augmentation that applies targeted perturbations to images and questions, discouraging reliance on superficial visual or linguistic cues. Experiments show that our method consistently outperforms existing VLMs and specialized spatial reasoning models.
翻译:空间推理——即感知并推断空间关系的能力——将视觉语言模型从视觉感知推进到空间语义理解。现有方法要么重新审视局部图像块,虽提升了细粒度感知却削弱了全局空间意识;要么标注孤立坐标,虽捕捉了对象位置却忽略了其整体组织。本研究将面向对象的认知蓝图概念融入视觉语言模型,以增强空间推理能力。给定图像和问题,模型首先构建一个记录相关对象位置、尺寸及属性的JSON格式蓝图,随后基于该结构化表征进行推理以生成最终答案。为实现这一目标,我们引入了三项关键技术:(1)用于监督微调的蓝图嵌入推理轨迹,以激发基础推理能力;(2)强化学习中的蓝图感知奖励机制,鼓励蓝图包含适当数量的对象,并使最终答案与这种因果推理保持一致;(3)抗捷径数据增强,对图像和问题施加针对性扰动,以降低模型对表面视觉或语言线索的依赖。实验表明,我们的方法在各项评估中均持续优于现有视觉语言模型及专用空间推理模型。