One of the key behavioral characteristics used in neuroscience to determine whether the subject of study -- be it a rodent or a human -- exhibits model-based learning is effective adaptation to local changes in the environment. In reinforcement learning, however, recent work has shown that modern deep model-based reinforcement-learning (MBRL) methods adapt poorly to such changes. An explanation for this mismatch is that MBRL methods are typically designed with sample-efficiency on a single task in mind and the requirements for effective adaptation are substantially higher, both in terms of the learned world model and the planning routine. One particularly challenging requirement is that the learned world model has to be sufficiently accurate throughout relevant parts of the state-space. This is challenging for deep-learning-based world models due to catastrophic forgetting. And while a replay buffer can mitigate the effects of catastrophic forgetting, the traditional first-in-first-out replay buffer precludes effective adaptation due to maintaining stale data. In this work, we show that a conceptually simple variation of this traditional replay buffer is able to overcome this limitation. By removing only samples from the buffer from the local neighbourhood of the newly observed samples, deep world models can be built that maintain their accuracy across the state-space, while also being able to effectively adapt to changes in the reward function. We demonstrate this by applying our replay-buffer variation to a deep version of the classical Dyna method, as well as to recent methods such as PlaNet and DreamerV2, demonstrating that deep model-based methods can adapt effectively as well to local changes in the environment.
翻译:神经科学中用于判断研究对象(无论是啮齿动物还是人类)是否具备基于模型学习的关键行为特征之一,是对环境局部变化的有效适应能力。然而,在强化学习领域,近年研究表明现代深度基于模型的强化学习方法对此类变化的适应性较差。导致这一不匹配的解释是:MBRL方法通常以单任务场景下的样本效率为设计目标,而有效适应所需的条件(包括学习的世界模型和规划流程)要求显著更高。一个特别具有挑战性的要求是:学习到的世界模型必须在整个状态空间的相关部分保持足够高的精度。由于灾难性遗忘的存在,这对基于深度学习的世界模型构成了挑战。尽管重放缓冲区可缓解灾难性遗忘的影响,但传统的先进先出重放缓冲区因保留过时数据而阻碍了有效适应。在本研究中,我们证明对这种传统重放缓冲区进行概念上简单的变体即可克服该限制。通过仅从缓冲区中移除新观测样本局部邻域内的样本,可以构建出能在整个状态空间维持精度,同时有效适应奖励函数变化的深度世界模型。我们将该重放缓冲区变体应用于经典Dyna方法的深度版本,以及PlaNet和DreamerV2等近期方法,验证了深度基于模型的方法同样能有效适应环境的局部变化。