Imitation Learning from Observation (ILfO) is a setting in which a learner tries to imitate the behavior of an expert, using only observational data and without the direct guidance of demonstrated actions. In this paper, we re-examine the use of optimal transport for IL, in which a reward is generated based on the Wasserstein distance between the state trajectories of the learner and expert. We show that existing methods can be simplified to generate a reward function without requiring learned models or adversarial learning. Unlike many other state-of-the-art methods, our approach can be integrated with any RL algorithm, and is amenable to ILfO. We demonstrate the effectiveness of this simple approach on a variety of continuous control tasks and find that it surpasses the state of the art in the IlfO setting, achieving expert-level performance across a range of evaluation domains even when observing only a single expert trajectory without actions.
翻译:观察模仿学习(ILfO)是一种学习范式,其中学习者仅通过观察数据,在没有示范动作直接指导的情况下,试图模仿专家的行为。本文重新审视了最优传输在模仿学习中的应用,该方法基于学习者和专家状态轨迹之间的Wasserstein距离生成奖励函数。我们证明,现有方法可被简化为无需学习模型或对抗学习即可生成奖励函数。与许多其他最先进方法不同,我们的方法可与任何强化学习算法集成,并且适用于ILfO场景。我们在多种连续控制任务上验证了这种简单方法的有效性,发现它在ILfO设置中超越了现有最优方法——即使仅观察单条不含动作的专家轨迹,也能在多个评估领域达到专家级性能。