Agents that can learn to imitate given video observation -- \emph{without direct access to state or action information} are more applicable to learning in the natural world. However, formulating a reinforcement learning (RL) agent that facilitates this goal remains a significant challenge. We approach this challenge using contrastive training to learn a reward function comparing an agent's behaviour with a single demonstration. We use a Siamese recurrent neural network architecture to learn rewards in space and time between motion clips while training an RL policy to minimize this distance. Through experimentation, we also find that the inclusion of multi-task data and additional image encoding losses improve the temporal consistency of the learned rewards and, as a result, significantly improves policy learning. We demonstrate our approach on simulated humanoid, dog, and raptor agents in 2D and a quadruped and a humanoid in 3D. We show that our method outperforms current state-of-the-art techniques in these environments and can learn to imitate from a single video demonstration.
翻译:能够通过学习视频观察进行模仿的智能体——无需直接获取状态或动作信息——更适用于自然世界中的学习。然而,设计支持这一目标的强化学习(RL)智能体仍是一项重大挑战。我们通过对比训练方法应对这一挑战,学习一种奖励函数,用于比较智能体行为与单段演示之间的差异。我们采用连体循环神经网络架构,在运动片段之间的时空维度上学习奖励,同时训练强化学习策略以最小化这一距离。实验表明,引入多任务数据与额外的图像编码损失能够提升所学奖励的时间一致性,并因此显著改善策略学习效果。我们在二维环境中的模拟人形、犬类及猛禽智能体,以及三维环境中的四足与双足人形智能体上验证了该方法。结果显示,我们的方法在这些场景中优于当前最先进的技术,并能够从单段视频演示中学会模仿。