Goal-Conditioned Imitation Learning using Score-based Diffusion Policies

We propose a new policy representation based on score-based diffusion models (SDMs). We apply our new policy representation in the domain of Goal-Conditioned Imitation Learning (GCIL) to learn general-purpose goal-specified policies from large uncurated datasets without rewards. Our new goal-conditioned policy architecture "$\textbf{BE}$havior generation with $\textbf{S}$c$\textbf{O}$re-based Diffusion Policies" (BESO) leverages a generative, score-based diffusion model as its policy. BESO decouples the learning of the score model from the inference sampling process, and, hence allows for fast sampling strategies to generate goal-specified behavior in just 3 denoising steps, compared to 30+ steps of other diffusion based policies. Furthermore, BESO is highly expressive and can effectively capture multi-modality present in the solution space of the play data. Unlike previous methods such as Latent Plans or C-Bet, BESO does not rely on complex hierarchical policies or additional clustering for effective goal-conditioned behavior learning. Finally, we show how BESO can even be used to learn a goal-independent policy from play-data using classifier-free guidance. To the best of our knowledge this is the first work that a) represents a behavior policy based on such a decoupled SDM b) learns an SDM based policy in the domain of GCIL and c) provides a way to simultaneously learn a goal-dependent and a goal-independent policy from play-data. We evaluate BESO through detailed simulation and show that it consistently outperforms several state-of-the-art goal-conditioned imitation learning methods on challenging benchmarks. We additionally provide extensive ablation studies and experiments to demonstrate the effectiveness of our method for goal-conditioned behavior generation. Demonstrations and Code are available at https://intuitive-robots.github.io/beso-website/

翻译：我们提出了一种基于分数扩散模型（SDM）的新型策略表示方法。我们将该策略表示应用于目标条件模仿学习（GCIL）领域，旨在从无奖励的大型未整理数据集中学习通用目标指定策略。我们的新型目标条件策略架构——"基于分数扩散模型的行为生成"（BESO）——利用生成式分数扩散模型作为其策略核心。BESO将分数模型的学习过程与推理采样过程解耦，从而能够采用快速采样策略，仅需3步去噪即可生成目标指定行为，而其他基于扩散的策略需30步以上。此外，BESO具有高度表达能力，可有效捕捉操作数据解空间中的多模态特性。与潜在计划或C-Bet等先前方法不同，BESO无需依赖复杂的层次化策略或额外聚类即可实现高效的目标条件行为学习。最后，我们展示了BESO如何通过无分类器引导技术，从操作数据中学习目标无关策略。据我们所知，这是首个满足以下条件的工作：a) 基于解耦SDM表示行为策略；b) 在GCIL领域学习基于SDM的策略；c) 提供从操作数据中同时学习目标相关与目标无关策略的方法。通过详细仿真评估，BESO在多个具有挑战性的基准测试中持续优于现有最先进的目标条件模仿学习方法。我们还提供了大量消融研究与实验，以验证该方法在目标条件行为生成中的有效性。演示与代码见 https://intuitive-robots.github.io/beso-website/