This paper focuses on the alignment of flow matching models with human preferences. A promising way is fine-tuning by directly backpropagating reward gradients through the differentiable generation process of flow matching. However, backpropagating through long trajectories results in prohibitive memory costs and gradient explosion. Therefore, direct-gradient methods struggle to update early generation steps, which are crucial for determining the global structure of the final image. To address this issue, we introduce LeapAlign, a fine-tuning method that reduces computational cost and enables direct gradient propagation from reward to early generation steps. Specifically, we shorten the long trajectory into only two steps by designing two consecutive leaps, each skipping multiple ODE sampling steps and predicting future latents in a single step. By randomizing the start and end timesteps of the leaps, LeapAlign leads to efficient and stable model updates at any generation step. To better use such shortened trajectories, we assign higher training weights to those that are more consistent with the long generation path. To further enhance gradient stability, we reduce the weights of gradient terms with large magnitude, instead of completely removing them as done in previous works. When fine-tuning the Flux model, LeapAlign consistently outperforms state-of-the-art GRPO-based and direct-gradient methods across various metrics, achieving superior image quality and image-text alignment.
翻译:本文聚焦于流匹配模型与人类偏好的对齐问题。一种有潜力的方法是利用流匹配可微生成过程直接反向传播奖励梯度进行微调。然而,通过长轨迹反向传播会导致内存成本过高及梯度爆炸。因此,直接梯度方法难以更新早期生成步长,而早期步长对最终图像全局结构的确定至关重要。为解决此问题,我们提出LeapAlign微调方法,该方法降低计算成本,并实现从奖励到早期生成步长的直接梯度传播。具体而言,我们通过设计两个连续跳跃将长轨迹缩短为仅两步:每个跳跃跳过多个ODE采样步骤,并单步预测未来隐变量。通过随机化跳跃的起始与结束时间步,LeapAlign可在任意生成步长实现高效稳定的模型更新。为更充分利用这类缩短轨迹,我们为与长生成路径一致性更高的轨迹分配更高训练权重。为增强梯度稳定性,我们降低梯度项中较大分量的权重,而非如先前方法般完全移除。在微调Flux模型时,LeapAlign在多项指标上持续优于基于GRPO的最先进方法及直接梯度方法,实现更优的图像质量与图文对齐性能。