Offline reinforcement learning (RL), which aims to learn an optimal policy using a previously collected static dataset, is an important paradigm of RL. Standard RL methods often perform poorly in this regime due to the function approximation errors on out-of-distribution actions. While a variety of regularization methods have been proposed to mitigate this issue, they are often constrained by policy classes with limited expressiveness that can lead to highly suboptimal solutions. In this paper, we propose representing the policy as a diffusion model, a recent class of highly-expressive deep generative models. We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy. In our approach, we learn an action-value function and we add a term maximizing action-values into the training loss of the conditional diffusion model, which results in a loss that seeks optimal actions that are near the behavior policy. We show the expressiveness of the diffusion model-based policy, and the coupling of the behavior cloning and policy improvement under the diffusion model both contribute to the outstanding performance of Diffusion-QL. We illustrate the superiority of our method compared to prior works in a simple 2D bandit example with a multimodal behavior policy. We then show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
翻译:离线强化学习旨在利用先前收集的静态数据集学习最优策略,是强化学习的重要范式。标准强化学习方法在此场景下常因分布外动作的函数逼近误差而表现不佳。尽管已有多种正则化方法被提出以缓解该问题,但它们往往受限于表达力有限的策略类,可能导致高度次优的解。本文提出将策略表示为扩散模型——一种近期出现的具有高表达力的深度生成模型。我们引入扩散Q学习,利用条件扩散模型来表示策略。该方法中,我们学习动作值函数,并在条件扩散模型的训练损失中加入最大化动作值的项,从而得到寻求接近行为策略的最优动作的损失函数。研究表明,基于扩散模型的策略的表达力,以及行为克隆与策略改进在扩散模型下的耦合,共同促成了扩散Q学习的优异性能。我们通过一个包含多模行为策略的简单二维赌博机示例,展示了本方法相较于先前工作的优越性。随后证明我们的方法能在大多数D4RL基准任务上达到最先进性能。