Imitation learning provides an efficient way to teach robots dexterous skills; however, learning complex skills robustly and generalizablely usually consumes large amounts of human demonstrations. To tackle this challenging problem, we present 3D Diffusion Policy (DP3), a novel visual imitation learning approach that incorporates the power of 3D visual representations into diffusion policies, a class of conditional action generative models. The core design of DP3 is the utilization of a compact 3D visual representation, extracted from sparse point clouds with an efficient point encoder. In our experiments involving 72 simulation tasks, DP3 successfully handles most tasks with just 10 demonstrations and surpasses baselines with a 24.2% relative improvement. In 4 real robot tasks, DP3 demonstrates precise control with a high success rate of 85%, given only 40 demonstrations of each task, and shows excellent generalization abilities in diverse aspects, including space, viewpoint, appearance, and instance. Interestingly, in real robot experiments, DP3 rarely violates safety requirements, in contrast to baseline methods which frequently do, necessitating human intervention. Our extensive evaluation highlights the critical importance of 3D representations in real-world robot learning. Videos, code, and data are available on https://3d-diffusion-policy.github.io .
翻译:模仿学习为机器人灵巧技能的学习提供了高效途径;然而,稳健且可泛化地学习复杂技能通常需要消耗大量人类示范数据。为应对这一挑战,我们提出了三维扩散策略(DP3)——一种新颖的视觉模仿学习方法,它将三维视觉表征的能力融入扩散策略(一类条件动作生成模型)。DP3的核心设计在于利用紧凑的三维视觉表征,该表征通过高效点云编码器从稀疏点云中提取。在涉及72个仿真任务的实验中,DP3仅需10次示范即可成功完成大多数任务,并以24.2%的相对优势超越基线方法。在4项真实机器人任务中,DP3在每项任务仅40次示范的条件下,展现出精确控制能力(成功率高达85%),并在空间、视角、外观和实例等多个维度表现出优异的泛化能力。值得注意的是,在真实机器人实验中,DP3极少违反安全要求,而基线方法则频繁出现此类问题并需要人工干预。我们的大规模评估凸显了三维表征在现实世界机器人学习中的关键重要性。视频、代码及数据详见 https://3d-diffusion-policy.github.io。