We propose a new policy representation based on score-based diffusion models (SDMs). We apply our new policy representation in the domain of Goal-Conditioned Imitation Learning (GCIL) to learn general-purpose goal-specified policies from large uncurated datasets without rewards. Our new goal-conditioned policy architecture "$\textbf{BE}$havior generation with $\textbf{S}$c$\textbf{O}$re-based Diffusion Policies" (BESO) leverages a generative, score-based diffusion model as its policy. BESO decouples the learning of the score model from the inference sampling process, and, hence allows for fast sampling strategies to generate goal-specified behavior in just 3 denoising steps, compared to 30+ steps of other diffusion based policies. Furthermore, BESO is highly expressive and can effectively capture multi-modality present in the solution space of the play data. Unlike previous methods such as Latent Plans or C-Bet, BESO does not rely on complex hierarchical policies or additional clustering for effective goal-conditioned behavior learning. Finally, we show how BESO can even be used to learn a goal-independent policy from play-data using classifier-free guidance. To the best of our knowledge this is the first work that a) represents a behavior policy based on such a decoupled SDM b) learns an SDM based policy in the domain of GCIL and c) provides a way to simultaneously learn a goal-dependent and a goal-independent policy from play-data. We evaluate BESO through detailed simulation and show that it consistently outperforms several state-of-the-art goal-conditioned imitation learning methods on challenging benchmarks. We additionally provide extensive ablation studies and experiments to demonstrate the effectiveness of our method for effective goal-conditioned behavior generation.
翻译:我们提出了一种基于分数扩散模型(SDMs)的新型策略表示方法。我们将该策略表示应用于目标条件模仿学习(GCIL)领域,以从大规模未整理数据集中学习通用目标指定策略,且无需奖励函数。我们提出的新型目标条件策略架构——基于分数扩散策略的行为生成(BESO)——利用生成式分数扩散模型作为其策略核心。BESO将分数模型的学习过程与推理采样过程解耦,从而支持快速采样策略,仅需3步去噪即可生成目标指定行为,而其他基于扩散的策略通常需要30步以上。此外,BESO具有高度表达能力,可有效捕捉游戏数据解空间中的多模态性。与Latent Plans或C-Bet等先前方法不同,BESO无需依赖复杂分层策略或额外聚类即可实现有效的目标条件行为学习。最后,我们展示了BESO如何通过无分类器引导从游戏数据中学习目标无关策略。据我们所知,这是首次实现以下成果的工作:a) 基于这种解耦SDM表示行为策略,b) 在GCIL领域学习基于SDM的策略,c) 提供从游戏数据中同时学习目标相关与目标无关策略的方法。我们通过详细的仿真实验评估BESO,结果表明其在多个具有挑战性的基准测试中持续优于多种最先进的目标条件模仿学习方法。此外,我们还提供了全面的消融实验与对比实验,以证明该方法在高效生成目标条件行为方面的有效性。