Recently, diffusion model shines as a promising backbone for the sequence modeling paradigm in offline reinforcement learning(RL). However, these works mostly lack the generalization ability across tasks with reward or dynamics change. To tackle this challenge, in this paper we propose a task-oriented conditioned diffusion planner for offline meta-RL(MetaDiffuser), which considers the generalization problem as conditional trajectory generation task with contextual representation. The key is to learn a context conditioned diffusion model which can generate task-oriented trajectories for planning across diverse tasks. To enhance the dynamics consistency of the generated trajectories while encouraging trajectories to achieve high returns, we further design a dual-guided module in the sampling process of the diffusion model. The proposed framework enjoys the robustness to the quality of collected warm-start data from the testing task and the flexibility to incorporate with different task representation method. The experiment results on MuJoCo benchmarks show that MetaDiffuser outperforms other strong offline meta-RL baselines, demonstrating the outstanding conditional generation ability of diffusion architecture.
翻译:最近,扩散模型在离线强化学习(RL)的序列建模范式中展现出作为有前景骨干架构的潜力。然而,这些研究大多缺乏跨任务(奖励或动力学变化)的泛化能力。为应对这一挑战,本文提出一种面向任务的条件化扩散规划器用于离线元强化学习(MetaDiffuser),该模型将泛化问题视为带有上下文表征的条件轨迹生成任务。其关键在于学习一个上下文条件化扩散模型,能够为跨多样化任务的规划生成面向任务的轨迹。为增强生成轨迹的动力学一致性并鼓励轨迹获得高回报,我们进一步在扩散模型的采样过程中设计了一个双引导模块。该框架对测试任务中收集的热启动数据质量具有鲁棒性,并能灵活地与不同任务表征方法相结合。在MuJoCo基准测试上的实验结果表明,MetaDiffuser优于其他强基线离线元强化学习方法,彰显了扩散架构卓越的条件生成能力。