Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real-world scenes. See the project page at https://zhoues.github.io/RoboRefer.
翻译:空间指代是具身机器人与三维物理世界交互的基础能力。然而,即使借助强大的预训练视觉语言模型,现有方法仍难以准确理解复杂的三维场景并动态推理指令指示的交互位置。为此,我们提出RoboRefer,一种三维感知的视觉语言模型。该模型首先通过监督微调集成解耦但专用的深度编码器,实现了精确的空间理解。此外,RoboRefer通过强化微调推进了广义多步空间推理,该过程采用了为空间指代任务量身定制的度量敏感过程奖励函数。为支持监督微调和强化微调训练,我们引入了RefSpatial,一个包含2000万问答对的大规模数据集(规模为先前工作的2倍),涵盖31种空间关系(先前为15种),并支持复杂的推理过程(最多5步)。此外,我们提出了RefSpatial-Bench,一个用于评估多步推理空间指代能力的挑战性基准,填补了该领域的空白。实验表明,经过监督微调的RoboRefer实现了最先进的空间理解能力,平均成功率达89.6%。经过强化微调的RoboRefer进一步大幅超越所有基线模型,在RefSpatial-Bench上的平均准确率甚至超过Gemini-2.5-Pro达17.4%。值得注意的是,RoboRefer可与多种控制策略集成,在杂乱的真实世界场景中为多样化机器人(例如UR5、G1人形机器人)执行长时程动态任务。项目页面详见 https://zhoues.github.io/RoboRefer。