Reward fine-tuning has become a common approach for aligning pretrained diffusion and flow models with human preferences in text-to-image generation. Among reward-gradient-based methods, Adjoint Matching (AM) provides a principled formulation by casting reward fine-tuning as a stochastic optimal control (SOC) problem. However, AM inevitably requires a substantial computational cost: it requires (i) stochastic simulation of full generative trajectories under memoryless dynamics, resulting in a large number of function evaluations, and (ii) backward ODE simulation of the adjoint state along each sampled trajectory. In this work, we observe that both bottlenecks are closely tied to the \textit{non-trivial base drift} inherited from the pretrained model. Motivated by this observation, we propose \textbf{Efficient Adjoint Matching (EAM)}, which substantially improves training efficiency by reformulating the SOC problem with a \textit{linear base drift} and a correspondingly modified \textit{terminal cost}. This reformulation removes both sources of inefficiency; it enables training-time sampling with a few-step deterministic ODE solver and yields a closed-form adjoint solution that eliminates backward adjoint simulation. On standard text-to-image reward fine-tuning benchmarks, EAM converges up to 4x faster than AM and matches or surpasses it across various metrics including PickScore, ImageReward, HPSv2.1, CLIPScore and Aesthetics.
翻译:奖励微调已成为将预训练扩散模型与流模型与文本到图像生成中的人类偏好对齐的常用方法。在基于奖励梯度的方法中,伴随匹配(AM)通过将奖励微调建模为随机最优控制(SOC)问题,提供了一种原则性框架。然而,AM不可避免地需要大量计算成本:它需要(i)在无记忆动力学下对完整生成轨迹进行随机模拟,导致大量函数评估次数,以及(ii)沿每条采样轨迹对伴随状态进行反向ODE模拟。在本工作中,我们观察到这两个瓶颈与预训练模型继承的\mathbf{非平凡基础漂移}密切相关。基于这一观察,我们提出\mathbf{高效伴随匹配(EAM)},该方法通过将SOC问题重新表述为具有\mathbf{线性基础漂移}和相应修改的\mathbf{终端代价}的形式,显著提升了训练效率。这一重新表述消除了两个低效来源:它能够在训练阶段使用几步确定性ODE求解器进行采样,并得到闭合形式的伴随解,从而省去反向伴随模拟。在标准文本到图像奖励微调基准测试中,EAM的收敛速度比AM快4倍,且在PickScore、ImageReward、HPSv2.1、CLIPScore和美学评分等各项指标上达到或超越AM的性能。