Denoising diffusion models excel at generating high-quality images conditioned on text prompts, yet their effectiveness heavily relies on careful guidance during the sampling process. Classifier-Free Guidance (CFG) provides a widely used mechanism for steering generation by setting the guidance scale, which balances image quality and prompt alignment. However, the choice of the guidance scale has a critical impact on the convergence toward a visually appealing and prompt-adherent image. In this work, we propose an annealing guidance scheduler which dynamically adjusts the guidance scale over time based on the conditional noisy signal. By learning a scheduling policy, our method addresses the temperamental behavior of CFG. Empirical results demonstrate that our guidance scheduler significantly enhances image quality and alignment with the text prompt, advancing the performance of text-to-image generation. Notably, our novel scheduler requires no additional activations or memory consumption, and can seamlessly replace the common classifier-free guidance, offering an improved trade-off between prompt alignment and quality.
翻译:去噪扩散模型在基于文本提示生成高质量图像方面表现出色,但其效果在很大程度上依赖于采样过程中的精细引导。无分类器引导(CFG)通过设置引导尺度,在图像质量与提示对齐之间进行权衡,提供了一种广泛使用的生成控制机制。然而,引导尺度的选择对最终生成视觉吸引力强且符合提示的图像具有关键影响。在本工作中,我们提出了一种退火引导调度器,它能够根据条件噪声信号随时间动态调整引导尺度。通过学习调度策略,我们的方法解决了CFG的不稳定行为。实证结果表明,我们的引导调度器显著提升了图像质量及其与文本提示的对齐度,从而推进了文本到图像生成的性能。值得注意的是,我们提出的新型调度器无需额外的激活计算或内存消耗,并且可以无缝替代常见的无分类器引导,在提示对齐与图像质量之间提供了更优的权衡。