Training a generalizable agent to continually learn a sequence of tasks from offline trajectories is a natural requirement for long-lived agents, yet remains a significant challenge for current offline reinforcement learning (RL) algorithms. Specifically, an agent must be able to rapidly adapt to new tasks using newly collected trajectories (plasticity), while retaining knowledge from previously learned tasks (stability). However, systematic analyses of this setting are scarce, and it remains unclear whether conventional continual learning (CL) methods are effective in continual offline RL (CORL) scenarios. In this study, we develop the Offline Continual World benchmark and demonstrate that traditional CL methods struggle with catastrophic forgetting, primarily due to the unique distribution shifts inherent to CORL scenarios. To address this challenge, we introduce CompoFormer, a structure-based continual transformer model that adaptively composes previous policies via a meta-policy network. Upon encountering a new task, CompoFormer leverages semantic correlations to selectively integrate relevant prior policies alongside newly trained parameters, thereby enhancing knowledge sharing and accelerating the learning process. Our experiments reveal that CompoFormer outperforms conventional CL methods, particularly in longer task sequences, showcasing a promising balance between plasticity and stability.
翻译:训练一个能够从离线轨迹中持续学习一系列任务的可泛化智能体,是长生命周期智能体的自然需求,但对当前离线强化学习算法而言仍是一个重大挑战。具体而言,智能体必须能够利用新收集的轨迹快速适应新任务(可塑性),同时保留先前学习任务的知识(稳定性)。然而,针对该设置的系统性分析尚显不足,且传统持续学习方法在持续离线强化学习场景中是否有效仍不明确。在本研究中,我们开发了Offline Continual World基准测试,并证明传统持续学习方法因持续离线强化学习场景固有的独特分布偏移而难以应对灾难性遗忘。为应对这一挑战,我们提出了CompoFormer——一种基于结构的持续Transformer模型,它通过元策略网络自适应地组合先前策略。当遇到新任务时,CompoFormer利用语义相关性,选择性地整合相关先验策略与新训练参数,从而增强知识共享并加速学习过程。我们的实验表明,CompoFormer在较长任务序列中尤其优于传统持续学习方法,展现出可塑性与稳定性之间的良好平衡。