Popular reinforcement learning (RL) algorithms tend to produce a unimodal policy distribution, which weakens the expressiveness of complicated policy and decays the ability of exploration. The diffusion probability model is powerful to learn complicated multimodal distributions, which has shown promising and potential applications to RL. In this paper, we formally build a theoretical foundation of policy representation via the diffusion probability model and provide practical implementations of diffusion policy for online model-free RL. Concretely, we character diffusion policy as a stochastic process, which is a new approach to representing a policy. Then we present a convergence guarantee for diffusion policy, which provides a theory to understand the multimodality of diffusion policy. Furthermore, we propose the DIPO which is an implementation for model-free online RL with DIffusion POlicy. To the best of our knowledge, DIPO is the first algorithm to solve model-free online RL problems with the diffusion model. Finally, extensive empirical results show the effectiveness and superiority of DIPO on the standard continuous control Mujoco benchmark.
翻译:主流的强化学习算法倾向于生成单峰策略分布,这削弱了复杂策略的表达能力并降低了探索性能。扩散概率模型能够有效学习复杂的多峰分布,在强化学习领域展现出广阔的应用前景。本文通过扩散概率模型系统构建了策略表示的理论基础,并提供了扩散策略在在线无模型强化学习中的实践方案。具体而言,我们将扩散策略表征为随机过程,这是一种新颖的策略表示方法;随后给出了扩散策略的收敛性保证,为理解其多峰特性提供了理论支撑。在此基础上,我们提出了基于扩散策略的在线无模型强化学习实现框架DIPO。据我们所知,DIPO是首个利用扩散模型解决在线无模型强化学习问题的算法。最后,大量实验结果表明,DIPO在标准连续控制基准Mujoco上具有显著的有效性和优越性。