Latent Wasserstein Adversarial Imitation Learning

Imitation Learning (IL) enables agents to mimic expert behavior by learning from demonstrations. However, traditional IL methods require large amounts of medium-to-high-quality demonstrations as well as actions of expert demonstrations, both of which are often unavailable. To reduce this need, we propose Latent Wasserstein Adversarial Imitation Learning (LWAIL), a novel adversarial imitation learning framework that focuses on state-only distribution matching. It benefits from the Wasserstein distance computed in a dynamics-aware latent space. This dynamics-aware latent space differs from prior work and is obtained via a pre-training stage, where we train the Intention Conditioned Value Function (ICVF) to capture a dynamics-aware structure of the state space using a small set of randomly generated state-only data. We show that this enhances the policy's understanding of state transitions, enabling the learning process to use only one or a few state-only expert episodes to achieve expert-level performance. Through experiments on multiple MuJoCo environments, we demonstrate that our method outperforms prior Wasserstein-based IL methods and prior adversarial IL methods, achieving better results across various tasks.

翻译：模仿学习（Imitation Learning, IL）使智能体能够通过从专家示范中学习来模仿专家行为。然而，传统的模仿学习方法需要大量中高质量的示范数据以及专家示范的动作信息，这两者在现实中往往难以获得。为了降低这一需求，我们提出了潜在Wasserstein对抗模仿学习（Latent Wasserstein Adversarial Imitation Learning, LWAIL），这是一种新颖的对抗模仿学习框架，专注于仅状态分布的匹配。该框架利用了在动态感知潜在空间中计算的Wasserstein距离。这一动态感知潜在空间不同于先前的工作，它是通过一个预训练阶段获得的，在该阶段我们训练意图条件价值函数（Intention Conditioned Value Function, ICVF），利用一小部分随机生成的仅状态数据来捕捉状态空间的动态感知结构。我们证明，这增强了策略对状态转移的理解，使得学习过程仅需使用一个或少数几个仅状态的专家轨迹即可达到专家级性能。通过在多个MuJoCo环境中的实验，我们证明了我们的方法优于先前基于Wasserstein的模仿学习方法以及先前的对抗模仿学习方法，在各种任务中取得了更好的结果。

相关内容

模仿学习

关注 324

模仿学习是学习尝试模仿专家行为从而获取最佳性能的一系列任务。目前主流方法包括监督式模仿学习、随机混合迭代学习和数据聚合模拟学习等方法。模仿学习（Imitation Learning）背后的原理是是通过隐含地给学习器关于这个世界的先验信息，比如执行、学习人类行为。在模仿学习任务中，智能体（agent）为了学习到策略从而尽可能像人类专家那样执行一种行为，它会寻找一种最佳的方式来使用由该专家示范的训练集（输入-输出对）。当智能体学习人类行为时，虽然我们也需要使用模仿学习，但实时的行为模拟成本会非常高。与之相反，吴恩达提出的学徒学习（Apprenticeship learning）执行的是存粹的贪婪/利用（exploitative）策略，并使用强化学习方法遍历所有的（状态和行为）轨迹（trajectories）来学习近优化策略。它需要极难的计略（maneuvers），而且几乎不可能从未观察到的状态还原。模仿学习能够处理这些未探索到的状态，所以可为自动驾驶这样的许多任务提供更可靠的通用框架。

深度学习时代的模仿学习：新型分类体系与最新研究进展

专知会员服务

11+阅读 · 2025年11月6日

【牛津大学博士论文】组合优化和接触追踪的模仿学习，229页pdf

专知会员服务

28+阅读 · 2023年11月14日

南京大学&港中文联合总结: 29页中文详述《模仿学习》完整过程

专知会员服务

63+阅读 · 2022年2月3日

【ICML2021】预测观察进行模仿学习

专知会员服务

24+阅读 · 2021年7月10日