Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning required for embodied applications. The agentic paradigm promises that VLMs can use a wide variety of tools that could augment these capabilities, such as depth estimators, segmentation models, and pose estimators. Yet it remains an open challenge how to realize this vision without solely relying on handcrafted prompting strategies or enforcing fixed, predefined tool pipelines that limit VLMs' ability to discover optimal tool-use patterns. Reinforcement Learning could overcome this gap, but has so far been limited to reasoning with a single visual tool due to the large search space in multi-tool reasoning. We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools through interactive exploration and feedback. In the teaching phase, we combine demonstrations from a single tool specialist trained via interactive RL with traces from a frontier model using all tools. In the exploration phase, the model further refines multi-tool coordination through continued RL. Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks (RoboSpatial-Home, BLINK, BOP-ASK) and demonstrates reliable real-world manipulation using a 7-DOF robot as a tool. DIRL provides substantial improvements over the vanilla SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines. Project page: https://spacetools.github.io/.
翻译:视觉语言模型在定性视觉理解上表现强劲,但在具身应用所需的精确度量空间推理方面存在不足。智能体范式有望使视觉语言模型能够利用多种工具来增强这些能力(例如深度估计器、分割模型和姿态估计器)。然而,如何在不完全依赖人工提示策略或强制使用固定预定义工具流水线(这限制了视觉语言模型发现最优工具使用模式的能力)的情况下实现这一愿景,仍是一个悬而未决的挑战。强化学习可以弥补这一差距,但受限于多工具推理的巨大搜索空间,目前仅适用于单一视觉工具的推理。我们提出双重交互式强化学习(DIRL),这是一个两阶段训练框架,视觉语言模型通过交互式探索和反馈学习协调多种工具。在教学阶段,我们将通过交互式强化学习训练的单一工具专家演示与使用所有工具的前沿模型轨迹相结合。在探索阶段,模型通过持续强化学习进一步优化多工具协调。我们的模型SpaceTools具备工具增强的空间推理能力,在空间理解基准测试(RoboSpatial-Home、BLINK、BOP-ASK)中达到最先进水平,并展示了使用七自由度机器人作为工具的可靠现实世界操控能力。DIRL在原始SFT(在RoboSpatial上提升+12%)和强化学习(在RoboSpatial上提升+16%)基线基础上实现了显著改进。项目页面:https://spacetools.github.io/。