CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining

Leveraging pre-trained 2D image representations in behavior cloning policies has achieved great success and has become a standard approach for robotic manipulation. However, such representations fail to capture the 3D spatial information about objects and scenes that is essential for precise manipulation. In this work, we introduce Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP), a novel 3D pre-training framework that utilizes point clouds and robot actions. From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates, including dynamic wrist views, to provide clearer views of target objects for high-precision manipulation tasks. The pre-trained encoders learn to associate the 3D geometric and positional information of objects with robot action patterns via contrastive learning on large-scale simulated robot trajectories. During encoder pre-training, we pre-train a Diffusion Policy to initialize the policy weights for fine-tuning, which is essential for improving fine-tuning sample efficiency and performance. After pre-training, we fine-tune the policy on a limited amount of task demonstrations using the learned image and action representations. We demonstrate that this pre-training and fine-tuning design substantially improves learning efficiency and policy performance on unseen tasks. Furthermore, we show that CLAMP outperforms state-of-the-art baselines across six simulated tasks and five real-world tasks. The project website and videos can be found at https://clamp3d.github.io/CLAMP/.

翻译：在行为克隆策略中利用预训练的2D图像表征已取得巨大成功，并成为机器人操控的标准方法。然而，此类表征未能捕捉精确操控所必需的物体与场景的三维空间信息。本研究提出面向3D多视角动作条件机器人操控预训练的对比学习（CLAMP），一种利用点云与机器人动作的新型3D预训练框架。从RGB-D图像与相机外参计算得到的融合点云中，我们重渲染包含深度与3D坐标的多视角四通道图像观测（含动态腕部视角），为高精度操控任务提供更清晰的目标物体视图。通过在大规模仿真机器人轨迹上进行对比学习，预训练编码器学习将物体的3D几何与位置信息关联至机器人动作模式。在编码器预训练阶段，我们预训练扩散策略以初始化微调策略权重，这对提升微调样本效率与性能至关重要。预训练后，利用学习到的图像与动作表征，在有限任务演示数据上微调策略。我们证明，这种预训练与微调设计显著提升了未见任务的学习效率与策略性能。此外，CLAMP在六个仿真任务与五个真实世界任务中均优于最先进的基线方法。项目网站与视频详见https://clamp3d.github.io/CLAMP/。