Diffusion-based policies have gained significant popularity in Reinforcement Learning (RL) due to their ability to represent complex, non-Gaussian distributions. Stochastic Differential Equation (SDE)-based diffusion policies often rely on indirect entropy control due to the intractability of the exact entropy, while also suffering from computationally prohibitive policy gradients through the iterative denoising chain. To overcome these issues, we propose Flow Matching Policy with Entropy Regularization (FMER), an Ordinary Differential Equation (ODE)-based online RL framework. FMER parameterizes the policy via flow matching and samples actions along a straight probability path, motivated by optimal transport. FMER leverages the model's generative nature to construct an advantage-weighted target velocity field from a candidate set, steering policy updates toward high-value regions. By deriving a tractable entropy objective, FMER enables principled maximum-entropy optimization for enhanced exploration. Experiments on sparse multi-goal FrankaKitchen benchmarks demonstrate that FMER outperforms state-of-the-art methods, while remaining competitive on standard MuJoco benchmarks. Moreover, FMER reduces training time by 7x compared to heavy diffusion baselines (QVPO) and 10-15% relative to efficient variants.
翻译:基于扩散的策略因其能够表示复杂的非高斯分布,在强化学习领域获得了广泛关注。然而,基于随机微分方程的扩散策略往往面临精确熵难以计算的问题,因而只能依赖间接的熵控制,同时其迭代去噪链还会导致计算代价高昂的策略梯度。为了解决这些问题,我们提出了带熵正则化的流匹配策略——一个基于常微分方程的在线强化学习框架。FMER通过流匹配来参数化策略,并在最优传输的启发下沿着直线概率路径采样动作。该方法利用模型的生成特性,从候选集中构建优势加权目标速度场,从而引导策略更新趋向高价值区域。通过推导出可计算的熵目标,FMER得以实现原则性的最大熵优化,以增强探索能力。在稀疏多目标FrankaKitchen基准测试上的实验表明,FMER性能优于现有最优方法,同时在标准MuJoco基准测试上保持竞争力。此外,相比计算量大的扩散基线(QVPO),FMER的训练时间减少了7倍;相比高效变体,训练时间也缩短了10-15%。