Reinforcement Learning (RL) has proven highly effective in addressing complex control and decision-making tasks. However, in most traditional RL algorithms, the policy is typically parameterized as a diagonal Gaussian distribution, which constrains the policy from capturing multimodal distributions, making it difficult to cover the full range of optimal solutions in multi-solution problems, and the return is reduced to a mean value, losing its multimodal nature and thus providing insufficient guidance for policy updates. In response to these problems, we propose a RL algorithm termed flow-based policy with distributional RL (FP-DRL). This algorithm models the policy using flow matching, which offers both computational efficiency and the capacity to fit complex distributions. Additionally, it employs a distributional RL approach to model and optimize the entire return distribution, thereby more effectively guiding multimodal policy updates and improving agent performance. Experimental trails on MuJoCo benchmarks demonstrate that the FP-DRL algorithm achieves state-of-the-art (SOTA) performance in most MuJoCo control tasks while exhibiting superior representation capability of the flow policy.
翻译:强化学习(RL)在解决复杂控制与决策任务中已被证明极为有效。然而,在大多数传统RL算法中,策略通常被参数化为对角高斯分布,这限制了策略捕捉多模态分布的能力,使其难以覆盖多解问题中的全部最优解集;同时回报被简化为均值,丧失了多模态特性,从而无法为策略更新提供充分指导。针对上述问题,我们提出一种名为“基于流策略的分布强化学习(FP-DRL)”的RL算法。该算法利用流匹配对策略进行建模,兼具计算效率与拟合复杂分布的能力。此外,算法采用分布强化学习方法对整个回报分布进行建模与优化,从而更有效地引导多模态策略更新,提升智能体性能。在MuJoCo基准测试上的实验结果表明,FP-DRL算法在大多数MuJoCo控制任务中达到了最先进的性能,同时展现出流策略卓越的表征能力。