Soft policies in reinforcement learning define policies as Boltzmann distributions over state-action value functions, providing a principled mechanism for balancing exploration and exploitation. However, realizing such soft policies in practice remains challenging. Existing approaches either depend on parametric policies with limited expressivity or employ diffusion-based policies whose intractable likelihoods hinder reliable entropy estimation in soft policy objectives. We address this challenge by directly realizing soft-policy sampling via Langevin dynamics driven by the action gradient of the Q-function. This perspective leads to Langevin Q-Learning (LQL), which samples actions from the target Boltzmann distribution without explicitly parameterizing the policy. However, directly applying Langevin dynamics suffers from slow mixing in high-dimensional and non-convex Q-landscapes, limiting its practical effectiveness. To overcome this, we propose Noise-Conditioned Langevin Q-Learning (NC-LQL), which integrates multi-scale noise perturbations into the value function. NC-LQL learns a noise-conditioned Q-function that induces a sequence of progressively smoothed value landscapes, enabling sampling to transition from global exploration to precise mode refinement. On OpenAI Gym MuJoCo benchmarks, NC-LQL achieves competitive performance compared to state-of-the-art diffusion-based methods, providing a simple yet powerful solution for online RL.
翻译:强化学习中的软策略将策略定义为状态-动作价值函数上的玻尔兹曼分布,为探索与利用的平衡提供了原理性机制。然而在实践中实现此类软策略仍具挑战性。现有方法要么依赖表达能力受限的参数化策略,要么采用基于扩散的策略——其难以处理的似然函数阻碍了软策略目标中可靠熵估计的实现。我们通过由Q函数动作梯度驱动的朗之万动力学直接实现软策略采样,以应对这一挑战。该视角催生了朗之万Q学习(LQL),该方法无需显式参数化策略即可从目标玻尔兹曼分布中采样动作。然而,直接应用朗之万动力学在高维非凸Q函数景观中存在混合速度缓慢的问题,限制了其实际有效性。为此,我们提出噪声条件化朗之万Q学习(NC-LQL),将多尺度噪声扰动整合至价值函数中。NC-LQL通过学习噪声条件化Q函数,诱导出一系列逐步平滑的价值景观,使采样过程能够从全局探索过渡到精确的模式优化。在OpenAI Gym MuJoCo基准测试中,NC-LQL相比最先进的基于扩散的方法展现出具有竞争力的性能,为在线强化学习提供了简洁而强大的解决方案。