Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Yatai Ji,An-Chieh Cheng,Yang Fu,Yukang Chen,Han Zhang,Zhaojing Yang,Wei Huang,Ka Chun Cheung,Song Han,Vidya Nariyambut Murali,Pavlo Molchanov,Jan Kautz,Simon See,Hongxu Yin,Ping Luo,Sifei Liu

Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.

翻译：空间视觉语言模型在几何感知方面取得了显著进展，但涉及深度、距离和场景关系的多步推理的复杂空间推理仍具挑战性。此外，不同空间查询需要截然不同的策略：有些最适合通过纯语言逐步演绎来解决，而另一些则需要在定量推理之前先进行明确的3D定位。我们提出用于空间视觉语言模型的双路径空间推理强化学习框架（SR-REAL），该统一框架为空间VLM配备两条互补推理路径：纯语言推理（LOR）执行逐步语言演绎，以及检测-再推理（DTR）通过区域标记检测3D几何线索（如中心点或边界框），然后进行显式几何推理。SR-REAL首先经过冷启动监督微调阶段，构建LOR和DTR的思维链监督，并建立区域到3D的接口，随后通过强化学习利用准确性和格式奖励优化策略模型；对于DTR，基于离散中心点的检测奖励进一步细化几何对齐。在多个空间基准测试中，SR-REAL显著优于空间VLM基线：（i）单个强化学习训练模型支持两种推理路径，DTR通过精确3D定位在区域感知任务中表现优异，LOR则增强通用空间推理；（ii）联合训练两条路径促进相互强化；（iii）高质量、混合的冷启动数据对稳定强化学习优化至关重要；（iv）该模型无需逐任务调参即可跨数据集和领域泛化，展示LOR与DTR间的正向迁移。