Imitation learning addresses the challenge of learning by observing an expert's demonstrations without access to reward signals from environments. Most existing imitation learning methods that do not require interacting with environments either model the expert distribution as the conditional probability p(a|s) (e.g., behavioral cloning, BC) or the joint probability p(s, a). Despite its simplicity, modeling the conditional probability with BC usually struggles with generalization. While modeling the joint probability can lead to improved generalization performance, the inference procedure is often time-consuming and the model can suffer from manifold overfitting. This work proposes an imitation learning framework that benefits from modeling both the conditional and joint probability of the expert distribution. Our proposed diffusion model-augmented behavioral cloning (DBC) employs a diffusion model trained to model expert behaviors and learns a policy to optimize both the BC loss (conditional) and our proposed diffusion model loss (joint). DBC outperforms baselines in various continuous control tasks in navigation, robot arm manipulation, dexterous manipulation, and locomotion. We design additional experiments to verify the limitations of modeling either the conditional probability or the joint probability of the expert distribution as well as compare different generative models. Ablation studies justify the effectiveness of our design choices.
翻译:模仿学习旨在通过观察专家的演示来学习行为,而无需从环境中获取奖励信号。现有的大多数无需与环境交互的模仿学习方法要么将专家分布建模为条件概率p(a|s)(例如行为克隆,BC),要么建模为联合概率p(s, a)。尽管行为克隆通过条件概率建模的方式简单直观,但其泛化能力通常较差。而联合概率建模虽能提升泛化性能,但推理过程往往耗时,且模型易受流形过拟合问题的影响。本文提出了一种通过同时建模专家分布的条件概率与联合概率来改进模仿学习的框架。我们提出的扩散模型增强行为克隆(DBC)利用扩散模型来建模专家行为,并学习一个策略以同时优化BC损失(条件概率)和所提出的扩散模型损失(联合概率)。在导航、机械臂操作、灵巧操作及运动控制等多种连续控制任务中,DBC均优于基线方法。我们设计了额外实验验证仅对专家分布的条件概率或联合概率建模的局限性,并比较了不同生成模型的性能。消融研究证实了我们设计选择的有效性。