Recent work shows that flow matching can be effective for scalar Q-value function estimation in reinforcement learning (RL), but it remains unclear why or how this approach differs from standard critics. Contrary to conventional belief, we show that their success is not explained by distributional RL, as explicitly modeling return distributions can reduce performance. Instead, we argue that the use of integration for reading out values and dense velocity supervision at each step of this integration process for training improves TD learning via two mechanisms. First, it enables robust value prediction through \emph{test-time recovery}, whereby iterative computation through integration dampens errors in early value estimates as more integration steps are performed. This recovery mechanism is absent in monolithic critics. Second, supervising the velocity field at multiple interpolant values induces more \emph{plastic} feature learning within the network, allowing critics to represent non-stationary TD targets without discarding previously learned features or overfitting to individual TD targets encountered during training. We formalize these effects and validate them empirically, showing that flow-matching critics substantially outperform monolithic critics (2$\times$ in final performance and around 5$\times$ in sample efficiency) in settings where loss of plasticity poses a challenge e.g., in high-UTD online RL problems, while remaining stable during learning.
翻译:近期研究表明,流匹配在强化学习(RL)中的标量Q值函数估计方面具有显著效果,但其与标准评论者(critic)方法的差异原因与机制尚不明确。与传统观点相反,我们发现其成功并不能通过分布强化学习(distributional RL)来解释,因为显式建模回报分布反而可能降低性能。我们认为,该方法通过积分过程读取数值,并在积分的每一步进行密集的速度监督训练,从而通过两种机制改进了时序差分(TD)学习。首先,它通过**测试时恢复**实现了稳健的价值预测,即通过积分进行迭代计算,随着积分步数的增加,能够抑制早期价值估计中的误差。这种恢复机制在单体评论者(monolithic critic)中并不存在。其次,在多个插值点对速度场进行监督,促使网络内部进行更具**可塑性**的特征学习,使得评论者能够表示非平稳的TD目标,同时不会丢弃先前学习的特征或对训练中遇到的单个TD目标产生过拟合。我们对这些效应进行了形式化分析,并通过实验验证了其有效性。结果表明,在可塑性丧失构成挑战的场景中(例如高UTD在线RL问题),流匹配评论者在最终性能上显著优于单体评论者(约2倍提升),样本效率提升约5倍,同时在学习过程中保持稳定。