Learning Geometrically-Grounded 3D Visual Representations for View-Generalizable Robotic Manipulation

Real-world robotic manipulation demands visuomotor policies capable of robust spatial scene understanding and strong generalization across diverse camera viewpoints. While recent advances in 3D-aware visual representations have shown promise, they still suffer from several key limitations, including reliance on multi-view observations during inference which is impractical in single-view restricted scenarios, incomplete scene modeling that fails to capture holistic and fine-grained geometric structures essential for precise manipulation, and lack of effective policy training strategies to retain and exploit the acquired 3D knowledge. To address these challenges, we present MethodName, a unified representation-policy learning framework for view-generalizable robotic manipulation. MethodName introduces a single-view 3D pretraining paradigm that leverages point cloud reconstruction and feed-forward gaussian splatting under multi-view supervision to learn holistic geometric representations. During policy learning, MethodName performs multi-step distillation to preserve the pretrained geometric understanding and effectively transfer it to manipulation skills. We conduct experiments on 12 RLBench tasks, where our approach outperforms the previous state-of-the-art method by 12.7% in average success rate. Further evaluation on six representative tasks demonstrates strong zero-shot view generalization, with success rate drops of only 22.0% and 29.7% under moderate and large viewpoint shifts respectively, whereas the state-of-the-art method suffers larger decreases of 41.6% and 51.5%.

翻译：现实世界的机器人操作要求视觉运动策略具备鲁棒的空间场景理解能力，并能在不同的相机视角间实现强泛化。尽管三维感知视觉表征的最新进展显示出潜力，但它们仍存在若干关键局限，包括推理时依赖多视角观测（这在单视角受限场景中不切实际）、不完整的场景建模（未能捕捉对精确操作至关重要的整体和细粒度几何结构），以及缺乏有效的策略训练策略来保留和利用习得的三维知识。为应对这些挑战，我们提出了MethodName，一个用于视角通用机器人操作的统一表征-策略学习框架。MethodName引入了一种单视角三维预训练范式，该范式利用点云重建和前馈高斯溅射技术，在多视角监督下学习整体几何表征。在策略学习阶段，MethodName执行多步蒸馏以保留预训练的几何理解，并将其有效迁移至操作技能。我们在12个RLBench任务上进行了实验，结果显示我们的方法在平均成功率上比先前的最先进方法高出12.7%。在六个代表性任务上的进一步评估展示了强大的零样本视角泛化能力：在中等和大幅视角变化下，成功率分别仅下降22.0%和29.7%，而最先进方法的下降幅度更大，分别为41.6%和51.5%。