Imitation from observation (IfO) is a learning paradigm that consists of training autonomous agents in a Markov Decision Process (MDP) by observing expert demonstrations without access to its actions. These demonstrations could be sequences of environment states or raw visual observations of the environment. Recent work in IfO has focused on this problem in the case of observations of low-dimensional environment states, however, access to these highly-specific observations is unlikely in practice. In this paper, we adopt a challenging, but more realistic problem formulation, learning control policies that operate on a learned latent space with access only to visual demonstrations of an expert completing a task. We present BootIfOL, an IfO algorithm that aims to learn a reward function that takes an agent trajectory and compares it to an expert, providing rewards based on similarity to agent behavior and implicit goal. We consider this reward function to be a distance metric between trajectories of agent behavior and learn it via contrastive learning. The contrastive learning objective aims to closely represent expert trajectories and to distance them from non-expert trajectories. The set of non-expert trajectories used in contrastive learning is made progressively more complex by bootstrapping from roll-outs of the agent learned through RL using the current reward function. We evaluate our approach on a variety of control tasks showing that we can train effective policies using a limited number of demonstrative trajectories, greatly improving on prior approaches that consider raw observations.
翻译:从观察中模仿(IfO)是一种学习范式,通过观察专家演示(不获取其动作)在马尔可夫决策过程(MDP)中训练自主智能体。这些演示可以是环境状态的序列,也可以是环境的原始视觉观测。近年来,IfO 研究主要聚焦于低维环境状态观测下的问题,然而,实际场景中往往难以获取此类高度特化的观测。本文采用更具挑战性但更贴近现实的设定:仅通过专家完成任务的视觉演示,学习在潜表征空间上运行的控制策略。我们提出 BootIfOL 算法,该算法旨在学习一种奖励函数,通过比较智能体轨迹与专家轨迹,基于与智能体行为及隐式目标的相似性提供奖励。我们将此奖励函数视为智能体行为轨迹间的距离度量,并通过对比学习进行学习。对比学习目标旨在紧密表征专家轨迹,同时使其与非专家轨迹保持距离。对比学习中使用的非专家轨迹集通过自举方式逐步复杂化——利用当前奖励函数经强化学习得到的智能体展开轨迹构建。我们在多种控制任务上评估该方法,结果表明,使用有限数量的演示轨迹即可训练出高效策略,显著优于先前考虑原始观测的方法。