Diffusion Reinforcement Learning via Centered Reward Distillation

Diffusion and flow models achieve State-Of-The-Art (SOTA) generative performance, yet many practically important behaviors such as fine-grained prompt fidelity, compositional correctness, and text rendering are weakly specified by score or flow matching pretraining objectives. Reinforcement Learning (RL) fine-tuning with external, black-box rewards is a natural remedy, but diffusion RL is often brittle. Trajectory-based methods incur high memory cost and high-variance gradient estimates; forward-process approaches converge faster but can suffer from distribution drift, and hence reward hacking. In this work, we present \textbf{Centered Reward Distillation (CRD)}, a diffusion RL framework derived from KL-regularized reward maximization built on forward-process-based fine-tuning. The key insight is that the intractable normalizing constant cancels under \emph{within-prompt centering}, yielding a well-posed reward-matching objective. To enable reliable text-to-image fine-tuning, we introduce techniques that explicitly control distribution drift: (\textit{i}) decoupling the sampler from the moving reference to prevent ratio-signal collapse, (\textit{ii}) KL anchoring to a CFG-guided pretrained model to control long-run drift and align with the inference-time semantics of the pre-trained model, and (\textit{iii}) reward-adaptive KL strength to accelerate early learning under large KL regularization while reducing late-stage exploitation of reward-model loopholes. Experiments on text-to-image post-training with \texttt{GenEval} and \texttt{OCR} rewards show that CRD achieves competitive SOTA reward optimization results with fast convergence and reduced reward hacking, as validated on unseen preference metrics.

翻译：扩散模型与流模型实现了最先进的生成性能，然而许多实际重要的行为——如细粒度提示保真度、组合正确性和文本渲染——通过分数匹配或流匹配预训练目标仅得到弱约束。利用外部黑盒奖励进行强化学习微调是一种自然的补救方案，但扩散强化学习往往较为脆弱。基于轨迹的方法内存成本高且梯度估计方差大；基于前向过程的方法收敛更快，但可能遭受分布漂移，进而导致奖励黑客行为。本文提出**中心化奖励蒸馏（CRD）**，这是一个基于前向过程微调、从KL正则化奖励最大化推导出的扩散强化学习框架。其核心洞见在于：在**提示内中心化**操作下，难以处理的归一化常数会被抵消，从而得到一个良定义的奖励匹配目标。为实现可靠的文本到图像微调，我们引入了显式控制分布漂移的技术：（i）将采样器与移动参考解耦，以防止比率信号崩溃；（ii）通过KL锚定到CFG引导的预训练模型，以控制长期漂移并与预训练模型在推理时的语义对齐；（iii）采用奖励自适应的KL强度，以在强KL正则化下加速早期学习，同时减少后期对奖励模型漏洞的利用。在基于`GenEval`和`OCR`奖励的文本到图像后训练实验中，CRD在未见过的偏好指标上验证了其能以快速收敛和减少奖励黑客的方式，达到具有竞争力的最先进奖励优化结果。