We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion or flow-matching policy with respect to a parameterized Q-function. Effective optimization requires exploiting the first-order information of the critic, but it is challenging to do so for flow or diffusion policies because direct gradient-based optimization via backpropagation through their multi-step denoising process is numerically unstable. Existing methods work around this either by only using the value and discarding the gradient information, or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which transforms the critic's action gradient to form a step-wise objective function that is free from unstable backpropagation, while providing an unbiased, expressive policy at the optimum. Combined with temporal-difference backup for critic learning, QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.
翻译:我们提出了一种新颖的基于时序差分(TD)的强化学习算法——基于伴随匹配的Q学习(QAM),该算法解决了连续动作强化学习中一个长期存在的挑战:如何高效地优化一个表达能力强的扩散或流匹配策略,使其与参数化的Q函数相匹配。有效的优化需要利用评论家的一阶信息,但对于流或扩散策略而言,这具有挑战性,因为通过其多步去噪过程进行基于梯度的直接反向传播优化在数值上是不稳定的。现有方法通过仅使用价值信息而丢弃梯度信息,或依赖牺牲策略表达能力或使学习策略产生偏差的近似方法,来规避这一挑战。QAM通过利用伴随匹配(生成建模中最近提出的一种技术)绕过了这两个挑战。该技术将评论家的动作梯度进行变换,形成一个步进式的目标函数,该函数避免了不稳定的反向传播,同时在最优解处提供无偏且表达能力强的策略。结合用于评论家学习的时序差分备份,QAM在离线和离线到在线强化学习的困难、稀疏奖励任务上,始终优于先前的方法。