A prominent approach to visual Reinforcement Learning (RL) is to learn an internal state representation using self-supervised methods, which has the potential benefit of improved sample-efficiency and generalization through additional learning signal and inductive biases. However, while the real world is inherently 3D, prior efforts have largely been focused on leveraging 2D computer vision techniques as auxiliary self-supervision. In this work, we present a unified framework for self-supervised learning of 3D representations for motor control. Our proposed framework consists of two phases: a pretraining phase where a deep voxel-based 3D autoencoder is pretrained on a large object-centric dataset, and a finetuning phase where the representation is jointly finetuned together with RL on in-domain data. We empirically show that our method enjoys improved sample efficiency in simulated manipulation tasks compared to 2D representation learning methods. Additionally, our learned policies transfer zero-shot to a real robot setup with only approximate geometric correspondence, and successfully solve motor control tasks that involve grasping and lifting from a single, uncalibrated RGB camera. Code and videos are available at https://yanjieze.com/3d4rl/ .
翻译:视觉强化学习的一种主流方法是利用自监督方法学习内部状态表示,其潜在优势在于通过额外的学习信号和归纳偏置提高样本效率和泛化能力。然而,尽管现实世界本质上是三维的,先前的研究主要集中于利用二维计算机视觉技术作为辅助自监督任务。本文提出了一个统一的框架,用于针对运动控制任务的自监督三维表示学习。所提出的框架包含两个阶段:预训练阶段在大型以物体为中心的数据集上预训练深度体素三维自编码器,微调阶段则在域内数据上联合微调该表示与强化学习。实验表明,与二维表示学习方法相比,我们的方法在模拟操作任务中具有更高的样本效率。此外,学习到的策略可零样本迁移至仅具备近似几何对应的真实机器人平台,并成功解决涉及从单个未标定RGB摄像头进行抓取和抬升的运动控制任务。代码和视频参见 https://yanjieze.com/3d4rl/ 。