3D policy learning promises superior generalization and cross-embodiment transfer, but progress has been hindered by training instabilities and severe overfitting, precluding the adoption of powerful 3D perception models. In this work, we systematically diagnose these failures, identifying the omission of 3D data augmentation and the adverse effects of Batch Normalization as primary causes. We propose a new architecture coupling a scalable transformer-based 3D encoder with a diffusion decoder, engineered specifically for stability at scale and designed to leverage large-scale pre-training. Our approach significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks, establishing a new and robust foundation for scalable 3D imitation learning. Project Page: https://r3d-policy.github.io/
翻译:三维策略学习有望实现卓越的泛化与跨形态迁移,然而训练不稳定与严重过拟合问题持续制约其发展,阻碍了强大三维感知模型的采纳。本文系统诊断了这些失败原因,确定三维数据增强的缺失以及批量归一化的负面影响为主要成因。我们提出一种新型架构,将可扩展的基于Transformer的三维编码器与扩散解码器耦合,专为规模化稳定性而设计,并充分利用大规模预训练。我们的方法在具有挑战性的操作基准测试中显著优于最先进的三维基线方法,为可扩展的三维模仿学习建立了全新且稳健的基础。项目页面:https://r3d-policy.github.io/