Goal-Conditioned Imitation Learning using Score-based Diffusion Policies

We propose a new policy representation based on score-based diffusion models (SDMs). We apply our new policy representation in the domain of Goal-Conditioned Imitation Learning (GCIL) to learn general-purpose goal-specified policies from large uncurated datasets without rewards. Our new goal-conditioned policy architecture "$\textbf{BE}$havior generation with $\textbf{S}$c$\textbf{O}$re-based Diffusion Policies" (BESO) leverages a generative, score-based diffusion model as its policy. BESO decouples the learning of the score model from the inference sampling process, and, hence allows for fast sampling strategies to generate goal-specified behavior in just 3 denoising steps, compared to 30+ steps of other diffusion based policies. Furthermore, BESO is highly expressive and can effectively capture multi-modality present in the solution space of the play data. Unlike previous methods such as Latent Plans or C-Bet, BESO does not rely on complex hierarchical policies or additional clustering for effective goal-conditioned behavior learning. Finally, we show how BESO can even be used to learn a goal-independent policy from play-data using classifier-free guidance. To the best of our knowledge this is the first work that a) represents a behavior policy based on such a decoupled SDM b) learns an SDM based policy in the domain of GCIL and c) provides a way to simultaneously learn a goal-dependent and a goal-independent policy from play-data. We evaluate BESO through detailed simulation and show that it consistently outperforms several state-of-the-art goal-conditioned imitation learning methods on challenging benchmarks. We additionally provide extensive ablation studies and experiments to demonstrate the effectiveness of our method for effective goal-conditioned behavior generation.

翻译：我们提出了一种基于分数扩散模型（SDMs）的新型策略表示方法。我们将该策略表示应用于目标条件模仿学习（GCIL）领域，以从大规模未整理数据集中学习通用目标指定策略，且无需奖励函数。我们提出的新型目标条件策略架构——基于分数扩散策略的行为生成（BESO）——利用生成式分数扩散模型作为其策略核心。BESO将分数模型的学习过程与推理采样过程解耦，从而支持快速采样策略，仅需3步去噪即可生成目标指定行为，而其他基于扩散的策略通常需要30步以上。此外，BESO具有高度表达能力，可有效捕捉游戏数据解空间中的多模态性。与Latent Plans或C-Bet等先前方法不同，BESO无需依赖复杂分层策略或额外聚类即可实现有效的目标条件行为学习。最后，我们展示了BESO如何通过无分类器引导从游戏数据中学习目标无关策略。据我们所知，这是首次实现以下成果的工作：a) 基于这种解耦SDM表示行为策略，b) 在GCIL领域学习基于SDM的策略，c) 提供从游戏数据中同时学习目标相关与目标无关策略的方法。我们通过详细的仿真实验评估BESO，结果表明其在多个具有挑战性的基准测试中持续优于多种最先进的目标条件模仿学习方法。此外，我们还提供了全面的消融实验与对比实验，以证明该方法在高效生成目标条件行为方面的有效性。

相关内容

SDM

关注 11

数据挖掘是从数据中发现有价值的知识的计算过程，是现代数据科学的核心。它在许多领域有着巨大的应用，包括科学、工程、医疗保健、商业和医学。这些字段中的典型数据集是大的、复杂的，而且通常是有噪声的。从这些数据集中提取知识需要使用复杂的、高性能的、有原则的分析技术和算法。这些技术反过来又需要在高性能计算基础设施上的实现，这些基础设施需要经过仔细的性能调优。强大的可视化技术和有效的用户界面对于使数据挖掘工具吸引来自不同学科的研究人员、分析师、数据科学家和应用程序开发人员以及利益相关者的可用性也至关重要。SDM确立了自己在数据挖掘领域的领先地位，并为解决这些问题的研究人员提供了一个在同行评审论坛上展示其工作的场所。SDM强调原则方法和坚实的数学基础，以其高质量和高影响力的技术论文而闻名，并提供强大的研讨会和教程程序(包括在会议注册中)。官网地址：http://dblp.uni-trier.de/db/conf/sdm/

DiffRec: 扩散推荐模型（SIGIR'23）

专知会员服务

48+阅读 · 2023年4月16日

【“大量”智能体的强化学习】《Many-Agent Reinforcement Learning》，327页博士论文，伦敦大学学院（UCL）

专知会员服务

119+阅读 · 2022年5月7日

【MIla】一种意识启发规划的基于模型强化学习，A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning

专知会员服务

24+阅读 · 2022年3月19日

【伯克利JD Co-Reyes博士论文】建立强化学习算法泛化:从潜在动力学模型到元学习，Building Reinforcement Learning Algorithms that Generalize: From Latent Dynamics Models to Meta-Learning

专知会员服务

45+阅读 · 2022年3月6日