Behavior Cloning (BC) methods are effective at learning complex manipulation tasks. However, they are prone to spurious correlation - expressive models may focus on distractors that are irrelevant to action prediction - and are thus fragile in real-world deployment. Prior methods have addressed this challenge by exploring different model architectures and action representations. However, none were able to balance between sample efficiency and robustness against distractors for solving manipulation tasks with a complex action space. We present \textbf{C}onstrained-\textbf{C}ontext \textbf{C}onditional \textbf{D}iffusion \textbf{M}odel (C3DM), a diffusion model policy for solving 6-DoF robotic manipulation tasks with robustness to distractions that can learn deployable robot policies from as little as five demonstrations. A key component of C3DM is a fixation step that helps the action denoiser to focus on task-relevant regions around a predicted fixation point while ignoring distractors in the context. We empirically show that C3DM is robust to out-of-distribution distractors, and consistently achieves high success rates on a wide array of tasks, ranging from table-top manipulation to industrial kitting that require varying levels of precision and robustness to distractors.
翻译:行为克隆(BC)方法在学习复杂操作任务方面效果显著。然而,它们容易受到伪相关性的影响——表达能力强的模型可能会关注与动作预测无关的干扰因素——因此在真实世界部署中较为脆弱。先前的方法通过探索不同的模型架构和动作表示来应对这一挑战。然而,对于具有复杂动作空间的操作任务,尚无方法能够在样本效率与抗干扰鲁棒性之间取得平衡。我们提出了\textbf{约束上下文条件扩散模型}(C3DM),这是一种用于解决六自由度机器人操作任务的扩散模型策略,其具备抗干扰鲁棒性,并能仅从五次演示中学习可部署的机器人策略。C3DM的一个关键组件是注视步骤,它帮助动作去噪器聚焦于预测注视点周围与任务相关的区域,同时忽略上下文中的干扰因素。我们通过实验证明,C3DM对分布外干扰具有鲁棒性,并在从桌面操作到需要不同精度和抗干扰鲁棒性的工业套件组装等广泛任务中,始终能实现高成功率。