Due to its training stability and strong expression, the diffusion model has attracted considerable attention in offline reinforcement learning. However, several challenges have also come with it: 1) The demand for a large number of diffusion steps makes the diffusion-model-based methods time inefficient and limits their applications in real-time control; 2) How to achieve policy improvement with accurate guidance for diffusion model-based policy is still an open problem. Inspired by the consistency model, we propose a novel time-efficiency method named Consistency Policy with Q-Learning (CPQL), which derives action from noise by a single step. By establishing a mapping from the reverse diffusion trajectories to the desired policy, we simultaneously address the issues of time efficiency and inaccurate guidance when updating diffusion model-based policy with the learned Q-function. We demonstrate that CPQL can achieve policy improvement with accurate guidance for offline reinforcement learning, and can be seamlessly extended for online RL tasks. Experimental results indicate that CPQL achieves new state-of-the-art performance on 11 offline and 21 online tasks, significantly improving inference speed by nearly 45 times compared to Diffusion-QL. We will release our code later.
翻译:由于扩散模型具有训练稳定性和强表达能力,近年来在离线强化学习中受到广泛关注。然而,该模型也带来若干挑战:1)大量扩散步数的需求使得基于扩散模型的方法时间效率低下,限制了其在实时控制中的应用;2)如何通过精确引导实现基于扩散模型策略的策略改进仍是一个开放性问题。受一致性模型启发,我们提出一种名为一致性策略与Q学习(CPQL)的高时效方法,该方法通过单步从噪声中推导出动作。通过建立逆扩散轨迹到目标策略的映射,我们同时解决了在使用学得Q函数更新基于扩散模型策略时的时间效率和不精确引导问题。实验证明,CPQL能够通过精确引导实现离线强化学习的策略改进,并可无缝扩展至在线强化学习任务。实验结果表明,CPQL在11个离线任务和21个在线任务上均取得最新最优性能,推理速度相比Diffusion-QL提升近45倍。相关代码将后续公开。