Recent advances in diffusion-based reinforcement learning (RL) methods have demonstrated promising results in a wide range of continuous control tasks. However, existing works in this field focus on the application of diffusion policies while leaving the diffusion critics unexplored. In fact, since policy optimization fundamentally relies on the critic, accurate value estimation is far more important than policy expressiveness. Furthermore, given the stochasticity of most reinforcement learning tasks, it has been confirmed that the critic is more appropriately depicted with a distributional model. Motivated by these points, we propose a novel distributional RL method with Diffusion Bridge Critics (DBC). DBC directly models the inverse cumulative distribution function (CDF) of the Q value. This allows us to accurately capture the value distribution and prevents it from collapsing into a trivial Gaussian distribution owing to the strong distribution-matching capability of the diffusion bridge. Moreover, we further derive an analytic integral formula to address discretization errors in DBC, which is essential in value estimation. To our knowledge, DBC is the first work to employ the diffusion bridge model as the critic. Notably, DBC is also a plug-and-play component and can be integrated into most existing RL frameworks. Experimental results on MuJoCo robot control benchmarks demonstrate the superiority of DBC compared with previous distributional critic models.
翻译:近年来,基于扩散的强化学习方法在连续控制任务中展现出广阔的应用前景。然而,该领域现有研究主要聚焦于扩散策略的应用,尚未对扩散批评器进行深入探索。事实上,由于策略优化根本上依赖于批评器,准确的价值估计远比策略表达能力更为重要。此外,考虑到多数强化学习任务具有随机性,已有研究证实采用分布模型描述批评器更为恰当。基于上述观点,本文提出一种基于扩散桥批评器的新型分布强化学习方法。该方法直接对Q值的逆累积分布函数进行建模,借助扩散桥强大的分布匹配能力,能够精确捕捉价值分布并防止其退化为平凡的高斯分布。此外,我们进一步推导出解析积分公式以解决该方法中的离散化误差问题,这对价值估计至关重要。据我们所知,这是首次采用扩散桥模型作为批评器的研究。值得注意的是,该方法具有即插即用特性,可与现有大多数强化学习框架集成。在MuJoCo机器人控制基准测试中的实验结果表明,相较于先前的分布批评器模型,该方法展现出显著优势。