RoboRefer：面向机器人视觉语言模型的空间指代与推理 (RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics)

Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real-world scenes. See the project page at https://zhoues.github.io/RoboRefer.

翻译：空间指代是具身机器人与三维物理世界交互的基础能力。然而，即使借助强大的预训练视觉语言模型，现有方法仍难以准确理解复杂的三维场景并动态推理指令指示的交互位置。为此，我们提出RoboRefer，一种三维感知的视觉语言模型。该模型首先通过监督微调集成解耦但专用的深度编码器，实现了精确的空间理解。此外，RoboRefer通过强化微调推进了广义多步空间推理，该过程采用了为空间指代任务量身定制的度量敏感过程奖励函数。为支持监督微调和强化微调训练，我们引入了RefSpatial，一个包含2000万问答对的大规模数据集（规模为先前工作的2倍），涵盖31种空间关系（先前为15种），并支持复杂的推理过程（最多5步）。此外，我们提出了RefSpatial-Bench，一个用于评估多步推理空间指代能力的挑战性基准，填补了该领域的空白。实验表明，经过监督微调的RoboRefer实现了最先进的空间理解能力，平均成功率达89.6%。经过强化微调的RoboRefer进一步大幅超越所有基线模型，在RefSpatial-Bench上的平均准确率甚至超过Gemini-2.5-Pro达17.4%。值得注意的是，RoboRefer可与多种控制策略集成，在杂乱的真实世界场景中为多样化机器人（例如UR5、G1人形机器人）执行长时程动态任务。项目页面详见 https://zhoues.github.io/RoboRefer。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日