Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained vision language models (VLMs), recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware VLM that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89.6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2.5-Pro by 17.4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e,g., UR5, G1 humanoid) in cluttered real-world scenes.
翻译:空间指代是具身机器人与三维物理世界交互的一项基本能力。然而,即使借助强大的预训练视觉语言模型,现有方法仍难以准确理解复杂的三维场景,并动态推理指令所指示的交互位置。为此,我们提出了RoboRefer,一种具备三维感知能力的视觉语言模型。该模型首先通过监督微调,集成一个解耦但专用的深度编码器,以实现精确的空间理解。此外,RoboRefer通过强化微调推进了广义的多步空间推理,该过程采用了为空间指代任务量身定制的、对度量敏感的进程奖励函数。为了支持监督微调和强化微调的训练,我们引入了RefSpatial,一个包含2000万个问答对的大规模数据集,覆盖31种空间关系,并支持复杂的推理过程。此外,我们还提出了RefSpatial-Bench,一个旨在填补多步推理空间指代评估空白的挑战性基准。实验表明,经过监督微调的RoboRefer在空间理解方面达到了最先进的水平,平均成功率为89.6%。经过强化微调的RoboRefer进一步大幅超越了所有其他基线模型,甚至在RefSpatial-Bench上的平均准确率超过了Gemini-2.5-Pro达17.4%。值得注意的是,RoboRefer可与多种控制策略集成,在杂乱的真实世界场景中,跨不同机器人执行长时程、动态的任务。