Beyond Single-Model Optimization: Preserving Plasticity in Continual Reinforcement Learning

Continual reinforcement learning must balance retention with adaptation, yet many methods still rely on \emph{single-model preservation}, committing to one evolving policy as the main reusable solution across tasks. Even when a previously successful policy is retained, it may no longer provide a reliable starting point for rapid adaptation after interference, reflecting a form of \emph{loss of plasticity} that single-policy preservation cannot address. Inspired by quality-diversity methods, we introduce \textsc{TeLAPA} (Transfer-Enabled Latent-Aligned Policy Archives), a continual RL framework that organizes behaviorally diverse policy neighborhoods into per-task archives and maintains a shared latent space so that archived policies remain comparable and reusable under non-stationary drift. This perspective shifts continual RL from retaining isolated solutions to maintaining \emph{skill-aligned neighborhoods} with competent and behaviorally related policies that support future relearning. In our MiniGrid CL setting, \textsc{TeLAPA} learns more tasks successfully, recovers competence faster on revisited tasks after interference, and retains higher performance across a sequence of tasks. Our analyses show that source-optimal policies are often not transfer-optimal, even within a local competent neighborhood, and that effective reuse depends on retaining and selecting among multiple nearby alternatives rather than collapsing them to one representative. Together, these results reframe continual RL around reusable and competent policy neighborhoods, providing a route beyond single-model preservation toward more plastic lifelong agents.

翻译：持续强化学习需要在保持与适应之间取得平衡，然而许多方法仍依赖于*单一模型保持*，即承诺采用一个不断演化的策略作为跨任务的主要可重用解决方案。即使保留了先前成功的策略，经过干扰后，它可能不再为快速适应提供可靠起点，这体现了一种单一策略保持无法解决的*可塑性丧失*形式。受质量-多样性方法的启发，我们提出了 \textsc{TeLAPA}（支持迁移的潜在对齐策略档案库），这是一个持续强化学习框架，它将行为多样化的策略邻域组织成每个任务的档案库，并维护共享潜在空间，使得已存档的策略在非平稳漂移下保持可比性和可重用性。这一视角将持续强化学习从保留孤立解决方案转变为维护*技能对齐邻域*，其中包含具备能力且行为相关的策略，以支持未来的再学习。在MiniGrid持续学习场景中，\textsc{TeLAPA} 能够成功学习更多任务，在发生干扰后快速恢复对已访问任务的胜任能力，并在整个任务序列中保持更高性能。我们的分析表明，即使在局部胜任邻域内，源最优策略通常并非迁移最优策略，而有效的重用依赖于保留和选择多个邻近替代方案，而非将它们合并为一个代表。这些结果共同将持续强化学习重新定位为围绕可重用且胜任的策略邻域，为超越单一模型保持、迈向更具可塑性的终身智能体提供了一条路径。