Offline Reinforcement Learning (RL) methods leverage previous experiences to learn better policies than the behavior policy used for data collection. In contrast to behavior cloning, which assumes the data is collected from expert demonstrations, offline RL can work with non-expert data and multimodal behavior policies. However, offline RL algorithms face challenges in handling distribution shifts and effectively representing policies due to the lack of online interaction during training. Prior work on offline RL uses conditional diffusion models to represent multimodal behavior in the dataset. Nevertheless, these methods are not tailored toward alleviating the out-of-distribution state generalization. We introduce a novel method, named State Reconstruction for Diffusion Policies (SRDP), incorporating state reconstruction feature learning in the recent class of diffusion policies to address the out-of-distribution generalization problem. State reconstruction loss promotes more descriptive representation learning of states to alleviate the distribution shift incurred by the out-of-distribution (OOD) states. We design a novel 2D Multimodal Contextual Bandit environment to illustrate the OOD generalization of SRDP compared to prior algorithms. In addition, we assess the performance of our model on D4RL continuous control benchmarks, namely the navigation of an 8-DoF ant and forward locomotion of half-cheetah, hopper, and walker2d, achieving state-of-the-art results.
翻译:离线强化学习方法利用先前经验学习比数据收集中使用的行为策略更优的策略。与假设数据来自专家演示的行为克隆不同,离线强化学习适用于非专家数据和多模态行为策略。然而,由于训练过程中缺乏在线交互,离线强化学习算法在应对分布偏移和有效表示策略方面面临挑战。以往关于离线强化学习的研究采用条件扩散模型来表示数据集中的多模态行为,但这些方法并未专门针对缓解离群状态泛化问题进行优化。我们提出了一种名为状态重建扩散策略(SRDP)的新方法,在最新类别扩散策略中融入状态重建特征学习以解决离群分布泛化问题。状态重建损失通过促进更具描述性的状态表示学习,从而缓解由离群分布(OOD)状态引发的分布偏移。我们设计了一个新颖的二维多模态上下文老虎机环境,以展示SRDP相较于先前算法的OOD泛化能力。此外,我们在D4RL连续控制基准(包括8自由度蚂蚁导航、半猎豹、跳跃者和步行者前向运动)上评估了模型性能,取得了最先进的结果。