Diffusion models excel at sampling from complex, unnormalized distributions. In this work, we extend Maximum Entropy Reinforcement Learning (ME-RL) to diffusion processes, enabling sampling from the optimal policy trajectory distribution. By minimizing a tractable upper bound on the reverse KL divergence between the diffusion policy and the optimal policy trajectory distributions, we derive a modified surrogate objective and introduce Diffusion-Augmented Markov Decision Processes (DA-MDPs). DA-MDPs allow for seamless integration of diffusion policies into any ME-RL method with minimal modifications. We demonstrate its effectiveness by adapting Proximal Policy Optimization (PPO), Wasserstein Policy Optimization (WPO), and Relative Entropy Pathwise Policy Optimization (REPPO) into their diffusion-based variants: DA-MDP: PPO, DA-MDP: WPO, and DA-MDP: REPPO. Empirical results on standard continuous-control benchmarks show that our approach matches or outperforms baseline methods, while experiments on multimodal benchmarks confirm its ability to model multimodal action distributions.
翻译:扩散模型在从复杂、未归一化分布中采样方面表现出色。本文我们将最大熵强化学习(ME-RL)扩展至扩散过程,从而能够从最优策略轨迹分布中采样。通过最小化扩散策略与最优策略轨迹分布之间反向KL散度的可处理上界,我们推导出一个修正的替代目标函数,并引入扩散增强型马尔可夫决策过程(DA-MDPs)。DA-MDPs允许以最小修改将扩散策略无缝集成至任何ME-RL方法中。我们通过将近端策略优化(PPO)、Wasserstein策略优化(WPO)和相对熵路径策略优化(REPPO)适配为其基于扩散的变体——DA-MDP:PPO、DA-MDP:WPO和DA-MDP:REPPO,验证了其有效性。在标准连续控制基准上的实验结果表明,我们的方法匹配或优于基线方法,而在多模态基准上的实验则证实了其建模多模态动作分布的能力。