Precise robot manipulations require rich spatial information in imitation learning. Image-based policies model object positions from fixed cameras, which are sensitive to camera view changes. Policies utilizing 3D point clouds usually predict keyframes rather than continuous actions, posing difficulty in dynamic and contact-rich scenarios. To utilize 3D perception efficiently, we present RISE, an end-to-end baseline for real-world imitation learning, which predicts continuous actions directly from single-view point clouds. It compresses the point cloud to tokens with a sparse 3D encoder. After adding sparse positional encoding, the tokens are featurized using a transformer. Finally, the features are decoded into robot actions by a diffusion head. Trained with 50 demonstrations for each real-world task, RISE surpasses currently representative 2D and 3D policies by a large margin, showcasing significant advantages in both accuracy and efficiency. Experiments also demonstrate that RISE is more general and robust to environmental change compared with previous baselines. Project website: rise-policy.github.io.
翻译:精确的机器人操作需要模仿学习中丰富的空间信息。基于图像的策略通过固定摄像头建模物体位置,对视角变化较为敏感。利用三维点云的策略通常预测关键帧而非连续动作,在动态且接触频繁的场景中存在困难。为高效利用三维感知,我们提出RISE——一种用于真实世界模仿学习的端到端基准方法,可直接从单视角点云预测连续动作。该方法通过稀疏三维编码器将点云压缩为标记,添加稀疏位置编码后,使用Transformer对标记进行特征提取,最终通过扩散头将特征解码为机器人动作。在每项真实任务仅需50次演示训练的情况下,RISE大幅超越当前代表性的二维与三维策略,在精度与效率方面均展现出显著优势。实验还表明,相较于现有基准方法,RISE对环境变化具有更强的泛化能力和鲁棒性。项目网站:rise-policy.github.io。