Diffusion models and flow matching have demonstrated remarkable success in text-to-image generation. While many existing alignment methods primarily focus on fine-tuning pre-trained generative models to maximize a given reward function, these approaches require extensive computational resources and may not generalize well across different objectives. In this work, we propose a novel alignment framework by leveraging the underlying nature of the alignment problem -- sampling from reward-weighted distributions -- and show that it applies to both diffusion models (via score guidance) and flow matching models (via velocity guidance). The score function (velocity field) required for the reward-weighted distribution can be decomposed into the pre-trained score (velocity field) plus a conditional expectation of the reward. For the alignment on the diffusion model, we identify a fundamental challenge: the adversarial nature of the guidance term can introduce undesirable artifacts in the generated images. Therefore, we propose a finetuning-free framework that trains a guidance network to estimate the conditional expectation of the reward. We achieve comparable performance to finetuning-based models with one-step generation with at least a 60% reduction in computational cost. For the alignment on flow matching, we propose a training-free framework that improves the generation quality without additional computational cost.
翻译:扩散模型和流匹配在文本到图像生成中已展现出显著的成功。尽管现有的许多对齐方法主要侧重于微调预训练的生成模型以最大化给定的奖励函数,但这些方法需要大量的计算资源,并且可能无法很好地泛化到不同的目标。在本工作中,我们通过利用对齐问题的本质——从奖励加权的分布中采样——提出了一种新颖的对齐框架,并证明它同时适用于扩散模型(通过分数引导)和流匹配模型(通过速度引导)。奖励加权分布所需的分数函数(速度场)可以分解为预训练的分数(速度场)加上奖励的条件期望。对于扩散模型的对齐,我们发现了一个根本性挑战:引导项的对抗性可能会在生成的图像中引入不良的伪影。因此,我们提出了一种无需微调的框架,该框架训练一个引导网络来估计奖励的条件期望。我们通过单步生成实现了与基于微调的模型相当的性能,同时计算成本至少降低了60%。对于流匹配的对齐,我们提出了一种无需训练的框架,该框架在不增加额外计算成本的情况下提升了生成质量。