Imitation learning addresses the challenge of learning by observing an expert's demonstrations without access to reward signals from environments. Most existing imitation learning methods that do not require interacting with environments either model the expert distribution as the conditional probability p(a|s) (e.g., behavioral cloning, BC) or the joint probability p(s, a). Despite the simplicity of modeling the conditional probability with BC, it usually struggles with generalization. While modeling the joint probability can improve generalization performance, the inference procedure is often time-consuming, and the model can suffer from manifold overfitting. This work proposes an imitation learning framework that benefits from modeling both the conditional and joint probability of the expert distribution. Our proposed Diffusion Model-Augmented Behavioral Cloning (DBC) employs a diffusion model trained to model expert behaviors and learns a policy to optimize both the BC loss (conditional) and our proposed diffusion model loss (joint). DBC outperforms baselines in various continuous control tasks in navigation, robot arm manipulation, dexterous manipulation, and locomotion. We design additional experiments to verify the limitations of modeling either the conditional probability or the joint probability of the expert distribution, as well as compare different generative models. Ablation studies justify the effectiveness of our design choices.
翻译:模仿学习旨在通过观察专家演示来学习,而无需访问环境中的奖励信号。大多数无需与环境交互的现有模仿学习方法要么将专家分布建模为条件概率 p(a|s)(例如行为克隆,BC),要么建模为联合概率 p(s, a)。尽管使用 BC 建模条件概率较为简单,但其通常难以实现良好的泛化。虽然建模联合概率可以提升泛化性能,但其推断过程往往耗时,且模型可能遭受流形过拟合。本文提出了一种模仿学习框架,该框架受益于同时对专家分布的条件概率和联合概率进行建模。我们提出的扩散模型增强的行为克隆(DBC)采用一个经过训练的扩散模型来建模专家行为,并学习一个策略以同时优化 BC 损失(条件概率)和我们提出的扩散模型损失(联合概率)。DBC 在导航、机械臂操作、灵巧操作和运动等多种连续控制任务中优于基线方法。我们设计了额外的实验来验证仅建模专家分布的条件概率或联合概率的局限性,并比较了不同的生成模型。消融研究证实了我们设计选择的有效性。