Recently, the strong latent Diffusion Probabilistic Model (DPM) has been applied to high-quality Text-to-Image (T2I) generation (e.g., Stable Diffusion), by injecting the encoded target text prompt into the gradually denoised diffusion image generator. Despite the success of DPM in practice, the mechanism behind it remains to be explored. To fill this blank, we begin by examining the intermediate statuses during the gradual denoising generation process in DPM. The empirical observations indicate, the shape of image is reconstructed after the first few denoising steps, and then the image is filled with details (e.g., texture). The phenomenon is because the low-frequency signal (shape relevant) of the noisy image is not corrupted until the final stage in the forward process (initial stage of generation) of adding noise in DPM. Inspired by the observations, we proceed to explore the influence of each token in the text prompt during the two stages. After a series of experiments of T2I generations conditioned on a set of text prompts. We conclude that in the earlier generation stage, the image is mostly decided by the special token [\texttt{EOS}] in the text prompt, and the information in the text prompt is already conveyed in this stage. After that, the diffusion model completes the details of generated images by information from themselves. Finally, we propose to apply this observation to accelerate the process of T2I generation by properly removing text guidance, which finally accelerates the sampling up to 25\%+.
翻译:近年来,强大的隐式扩散概率模型(DPM)通过将编码后的目标文本提示注入逐步去噪的扩散图像生成器,已成功应用于高质量文本到图像(T2I)生成(例如 Stable Diffusion)。尽管 DPM 在实践中取得了成功,但其背后的工作机制仍有待探索。为填补这一空白,我们首先考察了 DPM 中渐进去噪生成过程的中间状态。实证观察表明,图像形状在前几个去噪步骤后即被重建,随后图像被填充细节(例如纹理)。这一现象是因为在 DPM 添加噪声的前向过程(生成的初始阶段)中,噪声图像的低频信号(与形状相关)直到最终阶段才被破坏。受此观察启发,我们进而探究了文本提示中每个词元在这两个阶段的影响。通过对一组文本提示条件下的 T2I 生成进行系列实验,我们得出结论:在早期生成阶段,图像主要由文本提示中的特殊词元 [\texttt{EOS}] 决定,且文本提示中的信息已在此阶段传递。此后,扩散模型通过自身信息完成生成图像的细节。最后,我们提出应用这一观察,通过适当移除文本指导来加速 T2I 生成过程,最终使采样速度提升 25% 以上。