Iterative generative policies, such as diffusion models and flow matching, offer superior expressivity for continuous control but complicate Maximum Entropy Reinforcement Learning because their action log-densities are not directly accessible. To address this, we propose Field Least-Energy Actor-Critic (FLAC), a likelihood-free framework that regulates policy stochasticity by penalizing the kinetic energy of the velocity field. Our key insight is to formulate policy optimization as a Generalized Schrödinger Bridge (GSB) problem relative to a high-entropy reference process (e.g., uniform). Under this view, the maximum-entropy principle emerges naturally as staying close to a high-entropy reference while optimizing return, without requiring explicit action densities. In this framework, kinetic energy serves as a physically grounded proxy for divergence from the reference: minimizing path-space energy bounds the deviation of the induced terminal action distribution. Building on this view, we derive an energy-regularized policy iteration scheme and a practical off-policy algorithm that automatically tunes the kinetic energy via a Lagrangian dual mechanism. Empirically, FLAC achieves superior or comparable performance on high-dimensional benchmarks relative to strong baselines, while avoiding explicit density estimation.
翻译:迭代生成策略,如扩散模型和流匹配,为连续控制提供了卓越的表达能力,但由于其动作对数密度无法直接获取,使得最大熵强化学习变得复杂。为解决这一问题,我们提出了场最小能量执行者-评判者(FLAC),这是一个无需似然估计的框架,通过惩罚速度场的动能来调节策略的随机性。我们的核心见解是将策略优化表述为一个相对于高熵参考过程(例如均匀分布)的广义薛定谇桥问题。在此视角下,最大熵原则自然地体现为在优化回报的同时保持接近高熵参考,而无需显式的动作密度。在此框架中,动能作为偏离参考的物理基础代理指标:最小化路径空间能量可以约束诱导的终端动作分布的偏差。基于这一观点,我们推导出一种能量正则化的策略迭代方案和一个实用的离策略算法,该算法通过拉格朗日对偶机制自动调节动能。实验表明,在高维基准测试中,FLAC相对于强基线实现了优越或可比的性能,同时避免了显式的密度估计。