Recent work on visual representation learning has shown to be efficient for robotic manipulation tasks. However, most existing works pretrained the visual backbone solely on 2D images or egocentric videos, ignoring the fact that robots learn to act in 3D space, which is hard to learn from 2D observation. In this paper, we examine the effectiveness of pretraining for vision backbone with public-available large-scale 3D data to improve manipulation policy learning. Our method, namely Depth-aware Pretraining for Robotics (DPR), enables an RGB-only backbone to learn 3D scene representations from self-supervised contrastive learning, where depth information serves as auxiliary knowledge. No 3D information is necessary during manipulation policy learning and inference, making our model enjoy both efficiency and effectiveness in 3D space manipulation. Furthermore, we introduce a new way to inject robots' proprioception into the policy networks that makes the manipulation model robust and generalizable. We demonstrate in experiments that our proposed framework improves performance on unseen objects and visual environments for various robotics tasks on both simulated and real robots.
翻译:近期关于视觉表示学习的研究已证明其能有效提升机器人操控任务的性能。然而,现有工作大多仅基于二维图像或自我中心视频对视觉骨干网络进行预训练,忽略了机器人学习在三维空间执行动作这一本质特性——而三维操作难以直接从二维观测中习得。本文探究利用公开可获取的大规模三维数据对视觉骨干网络进行预训练的有效性,旨在优化操控策略学习。所提出的机器人深度感知预训练方法(DPR)可使仅依赖RGB输入的骨干网络通过自监督对比学习获取三维场景表征,其中深度信息作为辅助知识。在操控策略学习与推理阶段无需三维信息输入,使模型兼具三维空间操作的高效性与有效性。此外,我们还提出将机器人本体感知信息注入策略网络的新方法,显著提升操控模型的鲁棒性与泛化能力。实验表明,在仿真及实体机器人的多项任务中,所提框架能有效提升对未见过物体及视觉环境的操作性能。