Applying diffusion models in reinforcement learning for long-term planning has gained much attention recently. Several diffusion-based methods have successfully leveraged the modeling capabilities of diffusion for arbitrary distributions. These methods generate subsequent trajectories for planning and have demonstrated significant improvement. However, these methods are limited by their plain base distributions and their overlooking of the diversity of samples, in which different states have different returns. They simply leverage diffusion to learn the distribution of offline dataset, generate the trajectories whose states share the same distribution with the offline dataset. As a result, the probability of these models reaching the high-return states is largely dependent on the dataset distribution. Even equipped with the guidance model, the performance is still suppressed. To address these limitations, in this paper, we propose a novel method called CDiffuser, which devises a return contrast mechanism to pull the states in generated trajectories towards high-return states while pushing them away from low-return states to improve the base distribution. Experiments on 14 commonly used D4RL benchmarks demonstrate the effectiveness of our proposed method.
翻译:在强化学习中应用扩散模型进行长期规划近期备受关注。多种基于扩散的方法已成功利用扩散模型对任意分布的建模能力,生成规划所需的后续轨迹并展现出显著性能提升。然而,这些方法受限于其朴素的基础分布,且忽视了样本的多样性——不同状态对应不同回报。它们仅利用扩散模型学习离线数据集的分布,生成与离线数据共享相同分布的状态轨迹。因此,这些模型达到高回报状态的概率高度依赖于数据集分布。即使配备引导模型,其性能仍受到抑制。为解决上述局限,本文提出一种名为CDiffuser的新方法,通过设计回报对比机制,将生成轨迹中的状态拉向高回报状态的同时推离低回报状态,从而改进基础分布。在14个常用D4RL基准测试上的实验证明了该方法的有效性。