Adversarial imitation learning (AIL) has become a popular alternative to supervised imitation learning that reduces the distribution shift suffered by the latter. However, AIL requires effective exploration during an online reinforcement learning phase. In this work, we show that the standard, naive approach to exploration can manifest as a suboptimal local maximum if a policy learned with AIL sufficiently matches the expert distribution without fully learning the desired task. This can be particularly catastrophic for manipulation tasks, where the difference between an expert and a non-expert state-action pair is often subtle. We present Learning from Guided Play (LfGP), a framework in which we leverage expert demonstrations of multiple exploratory, auxiliary tasks in addition to a main task. The addition of these auxiliary tasks forces the agent to explore states and actions that standard AIL may learn to ignore. Additionally, this particular formulation allows for the reusability of expert data between main tasks. Our experimental results in a challenging multitask robotic manipulation domain indicate that LfGP significantly outperforms both AIL and behaviour cloning, while also being more expert sample efficient than these baselines. To explain this performance gap, we provide further analysis of a toy problem that highlights the coupling between a local maximum and poor exploration, and also visualize the differences between the learned models from AIL and LfGP.
翻译:对抗性模仿学习(AIL)已成为监督式模仿学习的一种流行替代方案,能够有效减少后者面临的分布偏移问题。然而,AIL在在线强化学习阶段需要高效的探索。本研究表明,若通过AIL学习的策略在未完全掌握目标任务的情况下充分匹配专家分布,则标准的朴素探索方法可能表现为次优局部最大值。这一现象对于操作任务尤为严重,因为专家与非专家状态-动作对之间的差异往往极其细微。我们提出引导式游戏学习(LfGP)框架,该框架除了主任务外,还利用多个探索性辅助任务的专家演示。这些辅助任务的引入迫使智能体探索标准AIL可能学会忽略的状态与动作。此外,该特定公式化方法允许在主任务间重复使用专家数据。在具有挑战性的多任务机器人操作领域中的实验结果表明,LfGP在显著优于AIL和行为克隆的同时,其专家样本效率也高于这些基线方法。为解释这一性能差距,我们进一步分析了揭示局部最大值与不良探索耦合关系的玩具问题,并可视化展示了AIL与LfGP学习模型之间的差异。