We study continual offline reinforcement learning, a practical paradigm that facilitates forward transfer and mitigates catastrophic forgetting to tackle sequential offline tasks. We propose a dual generative replay framework that retains previous knowledge by concurrent replay of generated pseudo-data. First, we decouple the continual learning policy into a diffusion-based generative behavior model and a multi-head action evaluation model, allowing the policy to inherit distributional expressivity for encompassing a progressive range of diverse behaviors. Second, we train a task-conditioned diffusion model to mimic state distributions of past tasks. Generated states are paired with corresponding responses from the behavior generator to represent old tasks with high-fidelity replayed samples. Finally, by interleaving pseudo samples with real ones of the new task, we continually update the state and behavior generators to model progressively diverse behaviors, and regularize the multi-head critic via behavior cloning to mitigate forgetting. Experiments demonstrate that our method achieves better forward transfer with less forgetting, and closely approximates the results of using previous ground-truth data due to its high-fidelity replay of the sample space. Our code is available at \href{https://github.com/NJU-RL/CuGRO}{https://github.com/NJU-RL/CuGRO}.
翻译:我们研究持续离线强化学习这一实用范式,该范式通过促进前向迁移与缓解灾难性遗忘来处理序列式离线任务。我们提出一种双生成回放框架,通过并行回放生成的伪数据来保留先验知识。首先,将持续学习策略解耦为基于扩散的生成式行为模型与多头动作评估模型,使策略能够继承分布表达能力以涵盖渐进扩展的多样化行为。其次,训练一个任务条件扩散模型模拟历史任务的状态分布,将生成的状态与行为生成器的对应响应配对,以高保真回放样本表征旧任务。最后,通过将新任务的伪样本与真实样本交错结合,持续更新状态与行为生成器以建模渐进多样化的行为,并通过行为克隆正则化多头批评家网络来缓解遗忘。实验表明,本方法在实现更好前向迁移的同时遗忘更少,且因对样本空间的高保真回放,其结果与使用历史真实数据高度接近。代码已开源至 \href{https://github.com/NJU-RL/CuGRO}{https://github.com/NJU-RL/CuGRO}。