Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization

Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality. It has been verified that utilizing diffusion policies can significantly improve the performance of RL algorithms in continuous control tasks by overcoming the limitations of unimodal policies, such as Gaussian policies, and providing the agent with enhanced exploration capabilities. However, existing works mainly focus on the application of diffusion policies in offline RL, while their incorporation into online RL is less investigated. The training objective of the diffusion model, known as the variational lower bound, cannot be optimized directly in online RL due to the unavailability of 'good' actions. This leads to difficulties in conducting diffusion policy improvement. To overcome this, we propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO). Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions. To fulfill these conditions, the Q-weight transformation functions are introduced for general scenarios. Additionally, to further enhance the exploration capability of the diffusion policy, we design a special entropy regularization term. We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions. Consequently, the QVPO algorithm leverages the exploration capabilities and multimodality of diffusion policies, preventing the RL agent from converging to a sub-optimal policy. To verify the effectiveness of QVPO, we conduct comprehensive experiments on MuJoCo benchmarks. The final results demonstrate that QVPO achieves state-of-the-art performance on both cumulative reward and sample efficiency.

翻译：扩散模型因其强大的表达能力和多模态特性，在强化学习领域获得了广泛关注。已有研究证实，通过克服高斯策略等单模态策略的局限性，利用扩散策略能够显著提升强化学习算法在连续控制任务中的性能，并为智能体提供更强的探索能力。然而，现有工作主要集中于扩散策略在离线强化学习中的应用，其在在线强化学习中的融合机制尚未得到充分研究。由于在线环境中缺乏"优质"动作样本，扩散模型的训练目标——即变分下界——无法直接优化，这导致扩散策略改进难以实施。为克服这一难题，本文提出一种新颖的无模型在线扩散强化学习算法：Q加权变分策略优化。具体而言，我们引入了Q加权变分损失函数，该损失函数可被证明在特定条件下是强化学习策略目标的紧致下界。为实现这些条件，我们针对通用场景提出了Q权重变换函数。此外，为增强扩散策略的探索能力，我们设计了特殊的熵正则化项。同时开发了高效的行为策略，通过降低在线交互时扩散策略的方差来提升采样效率。因此，QVPO算法充分发挥了扩散策略的探索优势与多模态特性，有效防止强化学习智能体收敛至次优策略。为验证QVPO的有效性，我们在MuJoCo基准测试中进行了全面实验。最终结果表明，QVPO在累积奖励和采样效率两方面均达到了最先进的性能水平。