Offline Reinforcement Learning (RL) methods leverage previous experiences to learn better policies than the behavior policy used for data collection. In contrast to behavior cloning, which assumes the data is collected from expert demonstrations, offline RL can work with non-expert data and multimodal behavior policies. However, offline RL algorithms face challenges in handling distribution shifts and effectively representing policies due to the lack of online interaction during training. Prior work on offline RL uses conditional diffusion models to represent multimodal behavior in the dataset. Nevertheless, these methods are not tailored toward alleviating the out-of-distribution state generalization. We introduce a novel method named State Reconstruction for Diffusion Policies (SRDP), incorporating state reconstruction feature learning in the recent class of diffusion policies to address the out-of-distribution generalization problem. State reconstruction loss promotes generalizable representation learning of states to alleviate the distribution shift incurred by the out-of-distribution (OOD) states. We design a novel 2D Multimodal Contextual Bandit environment to illustrate the OOD generalization and faster convergence of SRDP compared to prior algorithms. In addition, we assess the performance of our model on D4RL continuous control benchmarks, namely the navigation of an 8-DoF ant and forward locomotion of half-cheetah, hopper, and walker2d, achieving state-of-the-art results.
翻译:离线强化学习方法利用先前经验学习比数据采集时使用的行为策略更优的策略。与假设数据来自专家演示的行为克隆不同,离线强化学习可处理非专家数据及多模态行为策略。然而,由于训练过程中缺乏在线交互,离线强化学习算法在应对分布偏移和有效表示策略方面面临挑战。现有离线强化学习工作采用条件扩散模型表示数据集中的多模态行为,但这类方法并非专为缓解分布外状态泛化问题而设计。我们提出一种名为状态重建扩散策略(SRDP)的新方法,在最新扩散策略类别中融入状态重建特征学习,以解决分布外泛化问题。状态重建损失通过促进状态的可泛化表示学习,缓解由分布外状态引发的分布偏移。我们设计了一个新颖的二维多模态上下文赌博机环境,验证了SRDP相比现有算法在分布外泛化与更快收敛方面的优势。此外,我们在D4RL连续控制基准测试(包括八自由度蚂蚁导航、半猎豹前向奔跑、跳跳机器人与两足步行器)上评估模型性能,取得了最先进的结果。