TIGeR：面向机器人技术的视觉语言模型中工具集成的几何推理 (TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics)

Vision-Language Models (VLMs) have shown remarkable capabilities in spatial reasoning, yet they remain fundamentally limited to qualitative precision and lack the computational precision required for real-world robotics. Current approaches fail to leverage metric cues from depth sensors and camera calibration, instead reducing geometric problems to pattern recognition tasks that cannot deliver the centimeter-level accuracy essential for robotic manipulation. We present TIGeR (Tool-Integrated Geometric Reasoning), a novel framework that transforms VLMs from perceptual estimators to geometric computers by enabling them to generate and execute precise geometric computations through external tools. Rather than attempting to internalize complex geometric operations within neural networks, TIGeR empowers models to recognize geometric reasoning requirements, synthesize appropriate computational code, and invoke specialized libraries for exact calculations. To support this paradigm, we introduce TIGeR-300K, a comprehensive tool-invocation-oriented dataset covering point transformations, pose estimation, and spatial compatibility verification, complete with tool invocation sequences and intermediate computations. Through a two-stage training pipeline combining supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) with our proposed hierarchical reward design, TIGeR achieves SOTA performance on geometric reasoning benchmarks while demonstrating centimeter-level precision in real-world robotic manipulation tasks.

翻译：视觉语言模型（VLMs）在空间推理方面展现出卓越的能力，但其本质上仍局限于定性精度，缺乏现实世界机器人技术所需的计算精度。现有方法未能充分利用来自深度传感器和相机标定的度量线索，而是将几何问题简化为模式识别任务，无法提供机器人操作所必需的厘米级精度。我们提出了TIGeR（工具集成的几何推理），这是一种新颖的框架，它通过使VLMs能够生成并借助外部工具执行精确的几何计算，从而将其从感知估计器转变为几何计算机。TIGeR并非试图将复杂的几何操作内化于神经网络中，而是赋能模型识别几何推理需求、合成适当的计算代码，并调用专用库进行精确计算。为支持这一范式，我们引入了TIGeR-300K，这是一个全面的、面向工具调用的数据集，涵盖点变换、姿态估计和空间兼容性验证，并包含完整的工具调用序列和中间计算过程。通过结合监督微调（SFT）和我们提出的分层奖励设计进行强化微调（RFT）的两阶段训练流程，TIGeR在几何推理基准测试中实现了最先进的性能，同时在现实世界机器人操作任务中展现出厘米级精度。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日