Replay Buffer with Local Forgetting for Adapting to Local Environment Changes in Deep Model-Based Reinforcement Learning

One of the key behavioral characteristics used in neuroscience to determine whether the subject of study -- be it a rodent or a human -- exhibits model-based learning is effective adaptation to local changes in the environment, a particular form of adaptivity that is the focus of this work. In reinforcement learning, however, recent work has shown that modern deep model-based reinforcement-learning (MBRL) methods adapt poorly to local environment changes. An explanation for this mismatch is that MBRL methods are typically designed with sample-efficiency on a single task in mind and the requirements for effective adaptation are substantially higher, both in terms of the learned world model and the planning routine. One particularly challenging requirement is that the learned world model has to be sufficiently accurate throughout relevant parts of the state-space. This is challenging for deep-learning-based world models due to catastrophic forgetting. And while a replay buffer can mitigate the effects of catastrophic forgetting, the traditional first-in-first-out replay buffer precludes effective adaptation due to maintaining stale data. In this work, we show that a conceptually simple variation of this traditional replay buffer is able to overcome this limitation. By removing only samples from the buffer from the local neighbourhood of the newly observed samples, deep world models can be built that maintain their accuracy across the state-space, while also being able to effectively adapt to local changes in the reward function. We demonstrate this by applying our replay-buffer variation to a deep version of the classical Dyna method, as well as to recent methods such as PlaNet and DreamerV2, demonstrating that deep model-based methods can adapt effectively as well to local changes in the environment.

翻译：在神经科学中，用于判定研究对象——无论是啮齿动物还是人类——是否表现出基于模型学习的关键行为特征之一，是有效适应环境局部变化的能力。这种特殊的适应性正是本工作的核心焦点。然而在强化学习领域，近期研究表明，现代深度基于模型的强化学习方法在应对环境局部变化时适应性较差。导致这种不匹配的原因在于：基于模型的方法通常以单任务下的样本效率为设计目标，而有效适应局部变化对学习的世界模型和规划策略均提出了更高的要求。其中一项极具挑战性的需求是：学习到的世界模型在状态空间相关区域必须保持足够高的精度。由于灾难性遗忘问题，基于深度学习的世界模型很难满足这一要求。虽然回放缓冲区可以缓解灾难性遗忘，但传统的先进先出式回放缓冲区因保留过时数据而阻碍了有效适应。本研究表明，对传统回放缓冲区进行简洁概念性改进即可突破这一局限：仅从缓冲区中移除与新观测样本局部邻域相关的旧样本。这种方法构建的深度世界模型既能保持全局精度，又能有效适应奖励函数的局部变化。我们将该回放缓冲区变体应用于经典Dyna方法的深度实现，以及PlaNet和DreamerV2等最新方法，实验证明深度基于模型的方法同样能有效适应环境局部变化。