3D spatial perception is fundamental to generalizable robotic manipulation, yet obtaining reliable, high-quality 3D geometry remains challenging. Depth sensors suffer from noise and material sensitivity, while existing reconstruction models lack the precision and metric consistency required for physical interaction. We introduce Robo3R, a feed-forward, manipulation-ready 3D reconstruction model that predicts accurate, metric-scale scene geometry directly from RGB images and robot states in real time. Robo3R jointly infers scale-invariant local geometry and relative camera poses, which are unified into the scene representation in the canonical robot frame via a learned global similarity transformation. To meet the precision demands of manipulation, Robo3R employs a masked point head for sharp, fine-grained point clouds, and a keypoint-based Perspective-n-Point (PnP) formulation to refine camera extrinsics and global alignment. Trained on Robo3R-4M, a curated large-scale synthetic dataset with four million high-fidelity annotated frames, Robo3R consistently outperforms state-of-the-art reconstruction methods and depth sensors. Across downstream tasks including imitation learning, sim-to-real transfer, grasp synthesis, and collision-free motion planning, we observe consistent gains in performance, suggesting the promise of this alternative 3D sensing module for robotic manipulation.
翻译:三维空间感知是实现可泛化机器人操作的基础,然而获取可靠、高质量的三维几何信息仍具挑战性。深度传感器受噪声和材料敏感性影响,而现有重建模型缺乏物理交互所需的精度与度量一致性。我们提出Robo3R——一种前馈式、即插即用的三维重建模型,能够直接从RGB图像和机器人状态实时预测精确的度量级场景几何。Robo3R联合推断尺度不变的局部几何与相对相机位姿,并通过学习的全局相似变换将其统一到规范机器人坐标系下的场景表示中。为满足操作任务对精度的要求,Robo3R采用掩码点云头部生成锐利细粒度点云,并基于关键点的透视n点(PnP)公式优化相机外参与全局对齐。通过在Robo3R-4M数据集(包含四百万帧高保真标注帧的精选大规模合成数据集)上进行训练,Robo3R在各项指标上持续超越最先进的重建方法与深度传感器。在模仿学习、仿真到现实迁移、抓取合成及无碰撞运动规划等下游任务中,我们观察到性能的持续提升,这预示着该替代性三维感知模块在机器人操作领域的应用前景。