We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method for fine-tuning diffusion models to maximize differentiable reward functions, such as scores from human preference models. We first show that it is possible to backpropagate the reward function gradient through the full sampling procedure, and that doing so achieves strong performance on a variety of rewards, outperforming reinforcement learning-based approaches. We then propose more efficient variants of DRaFT: DRaFT-K, which truncates backpropagation to only the last K steps of sampling, and DRaFT-LV, which obtains lower-variance gradient estimates for the case when K=1. We show that our methods work well for a variety of reward functions and can be used to substantially improve the aesthetic quality of images generated by Stable Diffusion 1.4. Finally, we draw connections between our approach and prior work, providing a unifying perspective on the design space of gradient-based fine-tuning algorithms.
翻译:我们提出了直接奖励微调(DRaFT),这是一种简单而有效的方法,用于微调扩散模型以最大化可微奖励函数,例如来自人类偏好模型的评分。我们首先证明了通过完整采样过程反向传播奖励函数梯度是可行的,并且这种方法在多种奖励上实现了强劲性能,优于基于强化学习的方法。随后,我们提出了DRaFT的更高效变体:DRaFT-K(将反向传播截断至采样的最后K步)和DRaFT-LV(为K=1的情况提供更低方差的梯度估计)。我们的实验表明,这些方法适用于多种奖励函数,并能显著提升Stable Diffusion 1.4生成图像的审美质量。最后,我们通过将本方法与已有研究建立联系,为基于梯度的微调算法设计空间提供了统一的理论视角。