Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter $λ$ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of $λ$. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.
翻译:离线策略强化学习在预训练流策略上仍面临挑战,这源于多步采样过程中的优化不稳定性。近期提出的Q学习伴随匹配(QAM)通过将问题重构为具有学习评论器(critic)的无记忆随机最优控制(SOC)问题来解决这一难题。然而,QAM继承了评论器引导改进的根本脆弱性:当评论器条件不佳时,微小的评论器误差会被放大,常导致模型崩溃。本文提出了一种稳定的离线策略微调算法——信任区域Q伴随匹配(TRQAM),该算法通过投影对偶下降自适应地控制预训练流策略的路径空间KL散度。具体而言,我们在SOC动力学中优化信任区域参数λ,并从理论上证明路径空间KL散度可由λ的闭式函数表示。因此,我们的方法能够精确控制与预训练流策略的精确偏差,实现稳定的离线策略RL。通过在50个OGBench任务上的实验,TRQAM在离线RL和离线到在线RL场景中均持续优于现有方法。特别地,TRQAM在离线RL中实现了68%的整体成功率,显著提升了最强基线方法46%的表现。