We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method for fine-tuning diffusion models to maximize differentiable reward functions, such as scores from human preference models. We first show that it is possible to backpropagate the reward function gradient through the full sampling procedure, and that doing so achieves strong performance on a variety of rewards, outperforming reinforcement learning-based approaches. We then propose more efficient variants of DRaFT: DRaFT-K, which truncates backpropagation to only the last K steps of sampling, and DRaFT-LV, which obtains lower-variance gradient estimates for the case when K=1. We show that our methods work well for a variety of reward functions and can be used to substantially improve the aesthetic quality of images generated by Stable Diffusion 1.4. Finally, we draw connections between our approach and prior work, providing a unifying perspective on the design space of gradient-based fine-tuning algorithms.
翻译:我们提出直接奖励微调(DRaFT),这是一种简单有效的方法,用于微调扩散模型以最大化可微分奖励函数(例如来自人类偏好模型的分数)。我们首先证明,将奖励函数梯度反向传播通过完整采样过程是可行的,并且这种方法在多种奖励上取得了优于基于强化学习方法的性能。随后,我们提出了更高效的DRaFT变体:DRaFT-K,它将反向传播截断至仅最后K步采样;以及DRaFT-LV,它在K=1的情况下获得更低方差的梯度估计。实验表明,我们的方法适用于多种奖励函数,并能显著提升Stable Diffusion 1.4生成图像的视觉美感质量。最后,我们建立了本方法与先前工作的联系,为基于梯度的微调算法设计空间提供了统一视角。