Offline Reinforcement Learning (RL) methods leverage previous experiences to learn better policies than the behavior policy used for experience collection. In contrast to behavior cloning, which assumes the data is collected from expert demonstrations, offline RL can work with non-expert data and multimodal behavior policies. However, offline RL algorithms face challenges in handling distribution shifts and effectively representing policies due to the lack of online interaction during training. Prior work on offline RL uses conditional diffusion models to obtain expressive policies to represent multimodal behavior in the dataset. Nevertheless, they are not tailored toward alleviating the out-of-distribution state generalization. We introduce a novel method incorporating state reconstruction feature learning in the recent class of diffusion policies to address the out-of-distribution generalization problem. State reconstruction loss promotes more descriptive representation learning of states to alleviate the distribution shift incurred by the out-of-distribution states. We design a 2D Multimodal Contextual Bandit environment to demonstrate and evaluate our proposed model. We assess the performance of our model not only in this new environment but also on several D4RL benchmark tasks, achieving state-of-the-art results.
翻译:离线强化学习方法利用先前经验学习比行为策略(用于收集经验)更优的策略。与假设数据来自专家演示的行为克隆不同,离线强化学习可处理非专家数据及多模态行为策略。然而,由于训练过程中缺乏在线交互,离线强化学习算法在应对分布偏移和有效表示策略方面面临挑战。先前关于离线强化学习的研究采用条件扩散模型获得能表示数据集中多模态行为的表达性策略,但这些方法并非针对缓解分布外状态泛化问题而设计。我们提出一种新方法,在最新的扩散策略类别中融入状态重建特征学习,以解决分布外泛化问题。状态重建损失促进了更具描述性的状态表示学习,从而缓解由分布外状态引发的分布偏移。我们设计了一个二维多模态上下文赌博环境来展示并评估所提模型。我们不仅在该新环境中评估了模型性能,还在多个D4RL基准任务上取得了最先进的结果。