Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.
翻译:空间视觉语言模型在几何感知方面取得了显著进展,但涉及深度、距离和场景关系的多步推理的复杂空间推理仍具挑战性。此外,不同空间查询需要截然不同的策略:有些最适合通过纯语言逐步演绎来解决,而另一些则需要在定量推理之前先进行明确的3D定位。我们提出用于空间视觉语言模型的双路径空间推理强化学习框架(SR-REAL),该统一框架为空间VLM配备两条互补推理路径:纯语言推理(LOR)执行逐步语言演绎,以及检测-再推理(DTR)通过区域标记检测3D几何线索(如中心点或边界框),然后进行显式几何推理。SR-REAL首先经过冷启动监督微调阶段,构建LOR和DTR的思维链监督,并建立区域到3D的接口,随后通过强化学习利用准确性和格式奖励优化策略模型;对于DTR,基于离散中心点的检测奖励进一步细化几何对齐。在多个空间基准测试中,SR-REAL显著优于空间VLM基线:(i)单个强化学习训练模型支持两种推理路径,DTR通过精确3D定位在区域感知任务中表现优异,LOR则增强通用空间推理;(ii)联合训练两条路径促进相互强化;(iii)高质量、混合的冷启动数据对稳定强化学习优化至关重要;(iv)该模型无需逐任务调参即可跨数据集和领域泛化,展示LOR与DTR间的正向迁移。