While combining imitation learning (IL) and reinforcement learning (RL) is a promising way to address poor sample efficiency in autonomous behavior acquisition, methods that do so typically assume that the requisite behavior demonstrations are provided by an expert that behaves optimally with respect to a task reward. If, however, suboptimal demonstrations are provided, a fundamental challenge appears in that the demonstration-matching objective of IL conflicts with the return-maximization objective of RL. This paper introduces D-Shape, a new method for combining IL and RL that uses ideas from reward shaping and goal-conditioned RL to resolve the above conflict. D-Shape allows learning from suboptimal demonstrations while retaining the ability to find the optimal policy with respect to the task reward. We experimentally validate D-Shape in sparse-reward gridworld domains, showing that it both improves over RL in terms of sample efficiency and converges consistently to the optimal policy in the presence of suboptimal demonstrations.
翻译:结合模仿学习(IL)与强化学习(RL)是解决自主行为获取中样本效率低下的有效途径,但这类方法通常假设所需的演示行为由针对任务奖励最优表现的行为专家提供。然而,若提供的是次优演示,则会出现根本性挑战:IL的演示匹配目标与RL的回报最大化目标之间存在冲突。本文提出D-Shape方法,通过借助奖励塑形和目标条件化RL的思想来解决上述冲突,从而在保留基于任务奖励寻找最优策略能力的同时,实现从次优演示中的学习。我们在稀疏奖励网格世界域中验证了D-Shape,实验表明该方法既能比纯RL方法提升样本效率,又能在次优演示下稳定收敛至最优策略。