3D Diffuser Actor: Policy Diffusion with 3D Scene Representations

We marry diffusion policies and 3D scene representations for robot manipulation. Diffusion policies learn the action distribution conditioned on the robot and environment state using conditional diffusion models. They have recently shown to outperform both deterministic and alternative state-conditioned action distribution learning methods. 3D robot policies use 3D scene feature representations aggregated from a single or multiple camera views using sensed depth. They have shown to generalize better than their 2D counterparts across camera viewpoints. We unify these two lines of work and present 3D Diffuser Actor, a neural policy architecture that, given a language instruction, builds a 3D representation of the visual scene and conditions on it to iteratively denoise 3D rotations and translations for the robot's end-effector. At each denoising iteration, our model represents end-effector pose estimates as 3D scene tokens and predicts the 3D translation and rotation error for each of them, by featurizing them using 3D relative attention to other 3D visual and language tokens. 3D Diffuser Actor sets a new state-of-the-art on RLBench with an absolute performance gain of 16.3% over the current SOTA on a multi-view setup and an absolute gain of 13.1% on a single-view setup. On the CALVIN benchmark, it outperforms the current SOTA in the setting of zero-shot unseen scene generalization by being able to successfully run 0.2 more tasks, a 7% relative increase. It also works in the real world from a handful of demonstrations. We ablate our model's architectural design choices, such as 3D scene featurization and 3D relative attentions, and show they all help generalization. Our results suggest that 3D scene representations and powerful generative modeling are keys to efficient robot learning from demonstrations.

翻译：我们将扩散策略与3D场景表征相结合，用于机器人操作任务。扩散策略利用条件扩散模型学习基于机器人和环境状态的动作分布，其近期表现已超越确定性方法及其他基于状态条件的动作分布学习方法。3D机器人策略通过感知深度信息，从单个或多个相机视角聚合3D场景特征表征，并且已被证明在跨相机视角泛化能力上优于2D方法。我们融合这两条研究路线，提出3D Diffuser Actor——一种神经策略架构。该架构在给定语言指令时构建视觉场景的3D表征，并以其为条件对机器人末端执行器的3D旋转和平移进行迭代去噪。在每个去噪迭代中，我们的模型将末端执行器位姿估计表示为3D场景标记，通过计算这些标记与其它3D视觉和语言标记之间的3D相对注意力来提取特征，从而预测每个标记的3D平移和旋转误差。3D Diffuser Actor在RLBench上刷新了当前最优性能：在多视角设置下，绝对性能比当前SOTA提升16.3%；在单视角设置下绝对提升13.1%。在CALVIN基准测试中，该模型在零样本未见场景泛化设置下成功执行的任务数比当前SOTA多0.2个（相对提升7%），且仅需少量真实世界演示即可运作。我们消融研究了模型架构设计选择（如3D场景特征化与3D相对注意力），证明这些组件均有利于泛化。实验结果表明，3D场景表征与强大生成式建模是实现从演示中高效学习机器人操作的关键。