Humans often acquire new skills through observation and imitation. For robotic agents, learning from the plethora of unlabeled video demonstration data available on the Internet necessitates imitating the expert without access to its action, presenting a challenge known as Imitation Learning from Observations (ILfO). A common approach to tackle ILfO problems is to convert them into inverse reinforcement learning problems, utilizing a proxy reward computed from the agent's and the expert's observations. Nonetheless, we identify that tasks characterized by a progress dependency property pose significant challenges for such approaches; in these tasks, the agent needs to initially learn the expert's preceding behaviors before mastering the subsequent ones. Our investigation reveals that the main cause is that the reward signals assigned to later steps hinder the learning of initial behaviors. To address this challenge, we present a novel ILfO framework that enables the agent to master earlier behaviors before advancing to later ones. We introduce an Automatic Discount Scheduling (ADS) mechanism that adaptively alters the discount factor in reinforcement learning during the training phase, prioritizing earlier rewards initially and gradually engaging later rewards only when the earlier behaviors have been mastered. Our experiments, conducted on nine Meta-World tasks, demonstrate that our method significantly outperforms state-of-the-art methods across all tasks, including those that are unsolvable by them.
翻译:人类通常通过观察和模仿来获取新技能。对于机器人智能体而言,要从互联网上大量未标记的视频演示数据中学习,需要在无法获取专家动作的情况下模仿专家,这被称为观察模仿学习(ILfO)问题。解决ILfO问题的常见方法是将问题转化为逆向强化学习问题,通过智能体与专家的观察值计算代理奖励。然而,我们发现具有进程依赖特性的任务对此类方法构成重大挑战:在这类任务中,智能体需要先学习专家的前置行为,才能掌握后续行为。研究揭示,其根本原因在于分配给后续步骤的奖励信号会阻碍初始行为的学习。为解决这一挑战,我们提出了一种新颖的ILfO框架,使智能体能够在推进至后续行为之前先掌握前置行为。我们引入了一种自动折扣调度(ADS)机制,该机制能够在训练阶段自适应地改变强化学习中的折扣因子,优先关注早期奖励,并仅在智能体掌握前置行为后逐步引入后期奖励。我们在九个Meta-World任务上进行的实验表明,我们的方法在所有任务上均显著优于现有最先进方法,包括那些现有方法无法解决的任务。