Continual reinforcement learning challenges agents to acquire new skills while retaining previously learned ones with the goal of improving performance in both past and future tasks. Most existing approaches rely on model-free methods with replay buffers to mitigate catastrophic forgetting; however, these solutions often face significant scalability challenges due to large memory demands. Drawing inspiration from neuroscience, where the brain replays experiences to a predictive World Model rather than directly to the policy, we present ARROW (Augmented Replay for RObust World models), a model-based continual RL algorithm that extends DreamerV3 with a memory-efficient, distribution-matching replay buffer. Unlike standard fixed-size FIFO buffers, ARROW maintains two complementary buffers: a short-term buffer for recent experiences and a long-term buffer that preserves task diversity through intelligent sampling. We evaluate ARROW on two challenging continual RL settings: Tasks without shared structure (Atari), and tasks with shared structure, where knowledge transfer is possible (Procgen CoinRun variants). Compared to model-free and model-based baselines with replay buffers of the same-size, ARROW demonstrates substantially less forgetting on tasks without shared structure, while maintaining comparable forward transfer. Our findings highlight the potential of model-based RL and bio-inspired approaches for continual reinforcement learning, warranting further research.
翻译:持续强化学习要求智能体在获取新技能的同时保留先前习得的知识,旨在提升其在过去与未来任务中的表现。现有方法大多依赖基于回放缓冲区的无模型方法以减轻灾难性遗忘,但这些方案常因内存需求过大而面临显著的可扩展性挑战。受神经科学中大脑将经验回放至预测性世界模型(而非直接作用于策略)的启发,我们提出ARROW(基于增强回放的鲁棒世界模型)——一种基于模型的持续强化学习算法。该方法通过引入内存高效、分布匹配的回放缓冲区扩展DreamerV3。与标准固定大小的FIFO缓冲区不同,ARROW维护两个互补缓冲区:存储近期经验的短期缓冲区,以及通过智能采样保持任务多样性的长期缓冲区。我们在两类具有挑战性的持续强化学习场景中评估ARROW:无共享结构的任务(Atari)与存在知识迁移可能的共享结构任务(Procgen CoinRun变体)。与采用相同大小回放缓冲区的无模型及基于模型的基线方法相比,ARROW在无共享结构任务中遗忘程度显著降低,同时保持相当的前向迁移能力。我们的发现揭示了基于模型的强化学习与生物启发方法在持续强化学习领域的潜力,值得进一步深入研究。