Adversarial imitation learning (AIL) is a popular method that has recently achieved much success. However, the performance of AIL is still unsatisfactory on the more challenging tasks. We find that one of the major reasons is due to the low quality of AIL discriminator representation. Since the AIL discriminator is trained via binary classification that does not necessarily discriminate the policy from the expert in a meaningful way, the resulting reward might not be meaningful either. We propose a new method called Policy Contrastive Imitation Learning (PCIL) to resolve this issue. PCIL learns a contrastive representation space by anchoring on different policies and generates a smooth cosine-similarity-based reward. Our proposed representation learning objective can be viewed as a stronger version of the AIL objective and provide a more meaningful comparison between the agent and the policy. From a theoretical perspective, we show the validity of our method using the apprenticeship learning framework. Furthermore, our empirical evaluation on the DeepMind Control suite demonstrates that PCIL can achieve state-of-the-art performance. Finally, qualitative results suggest that PCIL builds a smoother and more meaningful representation space for imitation learning.
翻译:对抗模仿学习(Adversarial Imitation Learning, AIL)是一种近年取得显著成功的流行方法。然而,在更具挑战性的任务中,AIL的表现仍不尽如人意。我们发现主要原因之一在于AIL判别器表示质量较低。由于AIL判别器通过二分类训练,其未必能以有意义的方式区分策略与专家,导致最终奖励也可能缺乏意义。为解决该问题,我们提出一种名为策略对比模仿学习(Policy Contrastive Imitation Learning, PCIL)的新方法。PCIL通过锚定不同策略学习对比表示空间,并生成基于余弦相似度的平滑奖励函数。我们提出的表示学习目标可视为AIL目标的更强版本,能对智能体与策略进行更有意义的比较。从理论角度,我们利用学徒学习框架验证了方法的有效性。此外,在DeepMind Control套件上的实证评估表明,PCIL能够达到最先进的性能。最后,定性结果表明PCIL为模仿学习构建了更平滑、更有意义的表示空间。