Vision-Language Models (VLMs) have shown remarkable capabilities in spatial reasoning, yet they remain fundamentally limited to qualitative precision and lack the computational precision required for real-world robotics. Current approaches fail to leverage metric cues from depth sensors and camera calibration, instead reducing geometric problems to pattern recognition tasks that cannot deliver the centimeter-level accuracy essential for robotic manipulation. We present TIGeR (Tool-Integrated Geometric Reasoning), a novel framework that transforms VLMs from perceptual estimators to geometric computers by enabling them to generate and execute precise geometric computations through external tools. Rather than attempting to internalize complex geometric operations within neural networks, TIGeR empowers models to recognize geometric reasoning requirements, synthesize appropriate computational code, and invoke specialized libraries for exact calculations. To support this paradigm, we introduce TIGeR-300K, a comprehensive tool-invocation-oriented dataset covering point transformations, pose estimation, and spatial compatibility verification, complete with tool invocation sequences and intermediate computations. Through a two-stage training pipeline combining supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) with our proposed hierarchical reward design, TIGeR achieves SOTA performance on geometric reasoning benchmarks while demonstrating centimeter-level precision in real-world robotic manipulation tasks.
翻译:视觉语言模型(VLMs)在空间推理方面展现出卓越的能力,但其本质上仍局限于定性精度,缺乏现实世界机器人技术所需的计算精度。现有方法未能充分利用来自深度传感器和相机标定的度量线索,而是将几何问题简化为模式识别任务,无法提供机器人操作所必需的厘米级精度。我们提出了TIGeR(工具集成的几何推理),这是一种新颖的框架,它通过使VLMs能够生成并借助外部工具执行精确的几何计算,从而将其从感知估计器转变为几何计算机。TIGeR并非试图将复杂的几何操作内化于神经网络中,而是赋能模型识别几何推理需求、合成适当的计算代码,并调用专用库进行精确计算。为支持这一范式,我们引入了TIGeR-300K,这是一个全面的、面向工具调用的数据集,涵盖点变换、姿态估计和空间兼容性验证,并包含完整的工具调用序列和中间计算过程。通过结合监督微调(SFT)和我们提出的分层奖励设计进行强化微调(RFT)的两阶段训练流程,TIGeR在几何推理基准测试中实现了最先进的性能,同时在现实世界机器人操作任务中展现出厘米级精度。