This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning and avoids learning the transition dynamics entirely. We show that our GSP algorithm can learn significantly faster than a Double DQN baseline in a variety of situations.
翻译:本文研究了一种新的基于模型的强化学习方法,采用背景规划:将(近似)动态规划更新与无模型更新相结合,类似于Dyna架构。尽管背景规划使用了更多的内存和计算资源,但其性能往往劣于无模型替代方法(如Double DQN)。根本原因在于,学习得到的模型可能不准确,尤其是在多步迭代时,容易生成无效状态。为避免这一限制,本文将背景规划约束在一组(抽象的)子目标上,仅学习局部的、基于子目标的条件模型。这种目标空间规划(GSP)方法计算效率更高,能自然融入时间抽象机制以加快长程规划,并且完全无需学习状态转移动力学。实验表明,我们的GSP算法在多种场景下的学习速度均显著快于Double DQN基线方法。