Recently, graph-based planning algorithms have gained much attention to solve goal-conditioned reinforcement learning (RL) tasks: they provide a sequence of subgoals to reach the target-goal, and the agents learn to execute subgoal-conditioned policies. However, the sample-efficiency of such RL schemes still remains a challenge, particularly for long-horizon tasks. To address this issue, we present a simple yet effective self-imitation scheme which distills a subgoal-conditioned policy into the target-goal-conditioned policy. Our intuition here is that to reach a target-goal, an agent should pass through a subgoal, so target-goal- and subgoal- conditioned policies should be similar to each other. We also propose a novel scheme of stochastically skipping executed subgoals in a planned path, which further improves performance. Unlike prior methods that only utilize graph-based planning in an execution phase, our method transfers knowledge from a planner along with a graph into policy learning. We empirically show that our method can significantly boost the sample-efficiency of the existing goal-conditioned RL methods under various long-horizon control tasks.
翻译:近年来,基于图的规划算法在解决目标条件强化学习任务中备受关注:它们提供一系列子目标以达成最终目标,而智能体则学习执行子目标条件策略。然而,此类强化学习方案的样本效率仍面临挑战,尤其是在长周期任务中。为解决这一问题,我们提出一种简单而有效的自我模仿方案,该方案将子目标条件策略蒸馏为目标目标条件策略。我们的直觉是:为达到最终目标,智能体需经过子目标,因此目标目标条件策略与子目标条件策略应具有相似性。此外,我们提出一种新颖的方案——在规划路径中随机跳过已执行的子目标——这进一步提升了性能。与先前仅在执行阶段利用基于图规划的方法不同,我们的方法将规划器与图的知识迁移到策略学习中。实验表明,在多种长周期控制任务中,该方法能显著提升现有目标条件强化学习方法的样本效率。