Among on-policy reinforcement learning algorithms, Proximal Policy Optimization (PPO) demonstrates is widely favored for its simplicity, numerical stability, and strong empirical performance. Standard PPO relies on surrogate objectives defined via importance ratios, which require evaluating policy likelihood that is typically straightforward when the policy is modeled as a Gaussian distribution. However, extending PPO to more expressive, high-capacity policy models such as continuous normalizing flows (CNFs), also known as flow-matching models, is challenging because likelihood evaluation along the full flow trajectory is computationally expensive and often numerically unstable. To resolve this issue, we propose PolicyFlow, a novel on-policy CNF-based reinforcement learning algorithm that integrates expressive CNF policies with PPO-style objectives without requiring likelihood evaluation along the full flow path. PolicyFlow approximates importance ratios using velocity field variations along a simple interpolation path, reducing computational overhead without compromising training stability. To further prevent mode collapse and further encourage diverse behaviors, we propose the Brownian Regularizer, an implicit policy entropy regularizer inspired by Brownian motion, which is conceptually elegant and computationally lightweight. Experiments on diverse tasks across various environments including MultiGoal, PointMaze, IsaacLab and MuJoCo Playground show that PolicyFlow achieves competitive or superior performance compared to PPO using Gaussian policies and flow-based baselines including FPO and DPPO. Notably, results on MultiGoal highlight PolicyFlow's ability to capture richer multimodal action distributions.
翻译:在基于策略的强化学习算法中,近端策略优化(PPO)因其简洁性、数值稳定性和强大的实证性能而广受青睐。标准PPO依赖于通过重要性比率定义的代理目标,这需要评估策略似然——当策略被建模为高斯分布时,这种评估通常较为直接。然而,将PPO扩展到更具表达能力的高容量策略模型(如连续归一化流(CNFs),亦称为流匹配模型)具有挑战性,因为沿完整流轨迹的似然评估计算成本高昂且往往数值不稳定。为解决此问题,我们提出PolicyFlow,一种新颖的基于策略的CNF强化学习算法,它将表达能力强的CNF策略与PPO式目标相结合,而无需沿完整流路径进行似然评估。PolicyFlow通过沿简单插值路径的速度场变化来近似重要性比率,从而在不影响训练稳定性的前提下降低计算开销。为进一步防止模式崩溃并鼓励多样化的行为,我们提出了布朗正则化器——一种受布朗运动启发的隐式策略熵正则化器,其概念优雅且计算轻量。在MultiGoal、PointMaze、IsaacLab和MuJoCo Playground等多种环境下的多样化任务实验表明,与使用高斯策略的PPO以及包括FPO和DPPO在内的基于流的基线方法相比,PolicyFlow取得了具有竞争力或更优的性能。值得注意的是,在MultiGoal上的结果突显了PolicyFlow捕获更丰富的多峰动作分布的能力。