Leveraging pre-trained 2D image representations in behavior cloning policies has achieved great success and has become a standard approach for robotic manipulation. However, such representations fail to capture the 3D spatial information about objects and scenes that is essential for precise manipulation. In this work, we introduce Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP), a novel 3D pre-training framework that utilizes point clouds and robot actions. From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates, including dynamic wrist views, to provide clearer views of target objects for high-precision manipulation tasks. The pre-trained encoders learn to associate the 3D geometric and positional information of objects with robot action patterns via contrastive learning on large-scale simulated robot trajectories. During encoder pre-training, we pre-train a Diffusion Policy to initialize the policy weights for fine-tuning, which is essential for improving fine-tuning sample efficiency and performance. After pre-training, we fine-tune the policy on a limited amount of task demonstrations using the learned image and action representations. We demonstrate that this pre-training and fine-tuning design substantially improves learning efficiency and policy performance on unseen tasks. Furthermore, we show that CLAMP outperforms state-of-the-art baselines across six simulated tasks and five real-world tasks.
翻译:利用预训练的二维图像表征进行行为克隆策略已在机器人操作领域取得巨大成功,并成为标准方法。然而,此类表征无法捕获对精确操作至关重要的物体与场景三维空间信息。本文提出面向三维多视角动作条件机器人操作预训练的对比学习框架(CLAMP),这是一种利用点云与机器人动作的新型三维预训练框架。我们基于RGB-D图像与相机外参计算融合点云,并重渲染包含深度与三维坐标的多视角四通道图像观测(含动态腕部视角),为高精度操作任务提供更清晰的目标物体视角。预训练编码器通过在大规模仿真机器人轨迹上进行对比学习,掌握将物体的三维几何与位置信息与机器人动作模式相关联的能力。在编码器预训练阶段,我们同步预训练扩散策略以初始化策略权重,这对提升微调样本效率与性能至关重要。预训练完成后,我们利用学习到的图像与动作表征,在少量任务示教数据上对策略进行微调。实验证明,该预训练与微调设计能显著提升未见任务的学习效率与策略性能。此外,我们在六项仿真任务与五项现实任务中验证了CLAMP相较于前沿基线方法的优越性。