Text-to-image diffusion models have recently emerged at the forefront of image generation, powered by very large-scale unsupervised or weakly supervised text-to-image training datasets. Due to their unsupervised training, controlling their behavior in downstream tasks, such as maximizing human-perceived image quality, image-text alignment, or ethical image generation, is difficult. Recent works finetune diffusion models to downstream reward functions using vanilla reinforcement learning, notorious for the high variance of the gradient estimators. In this paper, we propose AlignProp, a method that aligns diffusion models to downstream reward functions using end-to-end backpropagation of the reward gradient through the denoising process. While naive implementation of such backpropagation would require prohibitive memory resources for storing the partial derivatives of modern text-to-image models, AlignProp finetunes low-rank adapter weight modules and uses gradient checkpointing, to render its memory usage viable. We test AlignProp in finetuning diffusion models to various objectives, such as image-text semantic alignment, aesthetics, compressibility and controllability of the number of objects present, as well as their combinations. We show AlignProp achieves higher rewards in fewer training steps than alternatives, while being conceptually simpler, making it a straightforward choice for optimizing diffusion models for differentiable reward functions of interest. Code and Visualization results are available at https://align-prop.github.io/.
翻译:文本到图像扩散模型近期凭借超大规模无监督或弱监督文本-图像训练数据集,已跃居图像生成领域前沿。由于其无监督训练特性,在最大化人类感知图像质量、图文对齐或伦理图像生成等下游任务中控制其行为颇具挑战。现有研究采用标准强化学习(其梯度估计器存在高方差问题)对扩散模型进行下游奖励函数微调。本文提出AlignProp方法,通过去噪过程中奖励梯度的端到端反向传播实现扩散模型与下游奖励函数的对齐。尽管朴素实现此类反向传播需要存储现代文本到图像模型偏导数的海量内存资源,AlignProp通过微调低秩适配器权重模块并结合梯度检查点技术,有效控制了内存占用。我们在文本-图像语义对齐、美学质量、可压缩性、目标数量可控性及其组合等多项目标上测试了AlignProp的微调效果。实验表明,AlignProp在更少训练步数内即可达到优于现有方法的奖励值,且概念更简洁,为针对可微分奖励函数优化扩散模型提供了直接解决方案。代码与可视化结果详见 https://align-prop.github.io/。