Beyond Single-Model Optimization: Preserving Plasticity in Continual Reinforcement Learning

Continual reinforcement learning must balance retention with adaptation, yet many methods still rely on \emph{single-model preservation}, committing to one evolving policy as the main reusable solution across tasks. Even when a previously successful policy is retained, it may no longer provide a reliable starting point for rapid adaptation after interference, reflecting a form of \emph{loss of plasticity} that single-policy preservation cannot address. Inspired by quality-diversity methods, we introduce \textsc{TeLAPA} (Transfer-Enabled Latent-Aligned Policy Archives), a continual RL framework that organizes behaviorally diverse policy neighborhoods into per-task archives and maintains a shared latent space so that archived policies remain comparable and reusable under non-stationary drift. This perspective shifts continual RL from retaining isolated solutions to maintaining \emph{skill-aligned neighborhoods} with competent and behaviorally related policies that support future relearning. In our MiniGrid CL setting, \textsc{TeLAPA} learns more tasks successfully, recovers competence faster on revisited tasks after interference, and retains higher performance across a sequence of tasks. Our analyses show that source-optimal policies are often not transfer-optimal, even within a local competent neighborhood, and that effective reuse depends on retaining and selecting among multiple nearby alternatives rather than collapsing them to one representative. Together, these results reframe continual RL around reusable and competent policy neighborhoods, providing a route beyond single-model preservation toward more plastic lifelong agents.

翻译：持续强化学习必须在保持与适应之间取得平衡，然而许多方法仍依赖**单一模型保存**，即承诺将一种不断演化的策略作为跨任务的主要可复用方案。即使保留了先前成功的策略，经过干扰后，它可能不再为快速适应提供可靠的起点，这反映了**可塑性丧失**的一种形式，而单一策略保存无法解决这一问题。受质量-多样性方法的启发，我们提出**TeLAPA**（支持迁移的潜在对齐策略档案馆），这是一种持续强化学习框架，它将行为多样的策略邻域组织为每个任务的档案馆，并维护一个共享的潜空间，使得存档的策略在非平稳漂移下仍具有可比性和可复用性。这一视角将持续强化学习从保留孤立解决方案转变为维护**技能对齐邻域**，其中包含有能力且行为相关的策略，以支持未来的再学习。在我们的MiniGrid持续学习设定中，TeLAPA能成功学习更多任务，在干扰后更快恢复对已访问任务的技能掌握，并在整个任务序列中保持更高的性能。分析表明，即使在一个局部有能力的邻域内，源最优策略往往不是迁移最优的，且有效的复用取决于保留并选择多个邻近替代方案，而非将其合并为一个代表。这些结果共同将持续强化学习重新定位为围绕可复用的能力策略邻域，为超越单一模型保存、迈向更具可塑性的终身智能体提供了一条路径。