Diffusion and flow matching models have achieved remarkable success in text-to-image generation. However, these models typically rely on the predetermined denoising schedules for all prompts. The multi-step reverse diffusion process can be regarded as a kind of chain-of-thought for generating high-quality images step by step. Therefore, diffusion models should reason for each instance to adaptively determine the optimal noise schedule, achieving high generation quality with sampling efficiency. In this paper, we introduce the Time Prediction Diffusion Model (TPDM) for this. TPDM employs a plug-and-play Time Prediction Module (TPM) that predicts the next noise level based on current latent features at each denoising step. We train the TPM using reinforcement learning to maximize a reward that encourages high final image quality while penalizing excessive denoising steps. With such an adaptive scheduler, TPDM not only generates high-quality images that are aligned closely with human preferences but also adjusts diffusion time and the number of denoising steps on the fly, enhancing both performance and efficiency. With Stable Diffusion 3 Medium architecture, TPDM achieves an aesthetic score of 5.44 and a human preference score (HPS) of 29.59, while using around 50% fewer denoising steps to achieve better performance.
翻译:扩散模型与流匹配模型在文本到图像生成领域取得了显著成功。然而,这些模型通常对所有提示均采用预定的去噪调度方案。多步反向扩散过程可被视为一种逐步生成高质量图像的思维链。因此,扩散模型应当针对每个实例进行推理,以自适应地确定最优噪声调度,在保证采样效率的同时实现高生成质量。本文为此提出时间预测扩散模型。该模型采用即插即用的时间预测模块,该模块能在每个去噪步骤中根据当前潜在特征预测下一噪声水平。我们通过强化学习训练时间预测模块,以最大化鼓励最终图像质量、同时惩罚过多去噪步数的奖励函数。借助这种自适应调度器,时间预测扩散模型不仅能够生成与人类偏好高度契合的高质量图像,还能动态调整扩散时间与去噪步数,从而同步提升性能与效率。基于Stable Diffusion 3 Medium架构,时间预测扩散模型在减少约50%去噪步数的条件下,实现了5.44的美学评分与29.59的人类偏好评分,同时获得了更优的综合性能。