In continual RL, the environment of a reinforcement learning (RL) agent undergoes change. A successful system should appropriately balance the conflicting requirements of retaining agent performance on already learned tasks, stability, whilst learning new tasks, plasticity. The first-in-first-out buffer is commonly used to enhance learning in such settings but requires significant memory. We explore the application of an augmentation to this buffer which alleviates the memory constraints, and use it with a world model model-based reinforcement learning algorithm, to evaluate its effectiveness in facilitating continual learning. We evaluate the effectiveness of our method in Procgen and Atari RL benchmarks and show that the distribution matching augmentation to the replay-buffer used in the context of latent world models can successfully prevent catastrophic forgetting with significantly reduced computational overhead. Yet, we also find such a solution to not be entirely infallible, and other failure modes such as the opposite -- lacking plasticity and being unable to learn a new task -- to be a potential limitation in continual learning systems.
翻译:在持续强化学习中,强化学习智能体所处的环境会发生变化。一个成功的系统需要适当平衡两个相互冲突的要求:在保留已学习任务性能(稳定性)的同时学习新任务(可塑性)。先进先出缓冲区常用于增强此类场景下的学习,但需要大量内存。我们探索了对该缓冲区的增强方法以缓解内存限制,并将其与基于世界模型的模型强化学习算法结合,评估其在促进持续学习方面的有效性。我们在Procgen和Atari强化学习基准上评估了该方法的效果,结果表明,在隐式世界模型背景下对经验回放进行分布匹配增强,能够以显著降低的计算开销成功防止灾难性遗忘。然而,我们也发现此类解决方案并非完全可靠,其他故障模式——例如相反情况:缺乏可塑性且无法学习新任务——可能成为持续学习系统的潜在局限。