We study continual offline reinforcement learning, a practical paradigm that facilitates forward transfer and mitigates catastrophic forgetting to tackle sequential offline tasks. We propose a dual generative replay framework that retains previous knowledge by concurrent replay of generated pseudo-data. First, we decouple the continual learning policy into a diffusion-based generative behavior model and a multi-head action evaluation model, allowing the policy to inherit distributional expressivity for encompassing a progressive range of diverse behaviors. Second, we train a task-conditioned diffusion model to mimic state distributions of past tasks. Generated states are paired with corresponding responses from the behavior generator to represent old tasks with high-fidelity replayed samples. Finally, by interleaving pseudo samples with real ones of the new task, we continually update the state and behavior generators to model progressively diverse behaviors, and regularize the multi-head critic via behavior cloning to mitigate forgetting. Experiments demonstrate that our method achieves better forward transfer with less forgetting, and closely approximates the results of using previous ground-truth data due to its high-fidelity replay of the sample space. Our code is available at \href{https://github.com/NJU-RL/CuGRO}{https://github.com/NJU-RL/CuGRO}.
翻译:我们研究持续离线强化学习,这一实用范式能够促进正向迁移并减轻灾难性遗忘,以处理序列化离线任务。我们提出双生成回放框架,通过同时回放生成的伪数据来保留先前知识。首先,我们将持续学习策略解耦为基于扩散的生成行为模型和多头动作评估模型,使策略能够继承分布表达能力以涵盖渐增的多样化行为范围。其次,我们训练任务条件扩散模型来模拟过去任务的状态分布。生成的状态与行为生成器的相应响应配对,以高保真重放样本表征旧任务。最后,通过将伪样本与新任务真实样本交错排列,我们持续更新状态和行为生成器以建模渐进多样化的行为,并通过行为克隆正则化多头评论家以减轻遗忘。实验表明,我们的方法在减少遗忘的同时实现了更好的正向迁移,并因样本空间的高保真重放而紧密逼近使用先前真实数据的结果。我们的代码开源在 \href{https://github.com/NJU-RL/CuGRO}{https://github.com/NJU-RL/CuGRO}。