Applying reinforcement learning (RL) to sparse reward domains is notoriously challenging due to insufficient guiding signals. Common RL techniques for addressing such domains include (1) learning from demonstrations and (2) curriculum learning. While these two approaches have been studied in detail, they have rarely been considered together. This paper aims to do so by introducing a principled task phasing approach that uses demonstrations to automatically generate a curriculum sequence. Using inverse RL from (suboptimal) demonstrations we define a simple initial task. Our task phasing approach then provides a framework to gradually increase the complexity of the task all the way to the target task, while retuning the RL agent in each phasing iteration. Two approaches for phasing are considered: (1) gradually increasing the proportion of time steps an RL agent is in control, and (2) phasing out a guiding informative reward function. We present conditions that guarantee the convergence of these approaches to an optimal policy. Experimental results on 3 sparse reward domains demonstrate that our task phasing approaches outperform state-of-the-art approaches with respect to asymptotic performance.
翻译:在稀疏奖励领域应用强化学习(RL)因缺乏足够的引导信号而极具挑战性。解决此类领域的常见RL技术包括:(1)从示范中学习;(2)课程学习。尽管这两种方法已得到详细研究,但鲜有将它们共同考虑。本文旨在通过引入一种基于原则的任务分阶段方法来实现这一目标,该方法利用示范自动生成课程序列。通过从(次优)示范中应用逆强化学习,我们定义了一个简单的初始任务。我们的任务分阶段方法随后提供了一个框架,逐步增加任务复杂度直至目标任务,同时在每个分阶段迭代中重新调整RL代理。考虑了两种分阶段方法:(1)逐步增加RL代理控制时间步的比例,以及(2)逐步淡化引导性信息奖励函数。我们提出了保证这些方法收敛到最优策略的条件。在3个稀疏奖励领域的实验结果表明,我们的任务分阶段方法在渐近性能方面优于现有最先进方法。