Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model

Recently, the strong latent Diffusion Probabilistic Model (DPM) has been applied to high-quality Text-to-Image (T2I) generation (e.g., Stable Diffusion), by injecting the encoded target text prompt into the gradually denoised diffusion image generator. Despite the success of DPM in practice, the mechanism behind it remains to be explored. To fill this blank, we begin by examining the intermediate statuses during the gradual denoising generation process in DPM. The empirical observations indicate, the shape of image is reconstructed after the first few denoising steps, and then the image is filled with details (e.g., texture). The phenomenon is because the low-frequency signal (shape relevant) of the noisy image is not corrupted until the final stage in the forward process (initial stage of generation) of adding noise in DPM. Inspired by the observations, we proceed to explore the influence of each token in the text prompt during the two stages. After a series of experiments of T2I generations conditioned on a set of text prompts. We conclude that in the earlier generation stage, the image is mostly decided by the special token [\texttt{EOS}] in the text prompt, and the information in the text prompt is already conveyed in this stage. After that, the diffusion model completes the details of generated images by information from themselves. Finally, we propose to apply this observation to accelerate the process of T2I generation by properly removing text guidance, which finally accelerates the sampling up to 25\%+.

翻译：近年来，强大的隐式扩散概率模型（DPM）通过将编码后的目标文本提示注入逐步去噪的扩散图像生成器，已成功应用于高质量文本到图像（T2I）生成（例如 Stable Diffusion）。尽管 DPM 在实践中取得了成功，但其背后的工作机制仍有待探索。为填补这一空白，我们首先考察了 DPM 中渐进去噪生成过程的中间状态。实证观察表明，图像形状在前几个去噪步骤后即被重建，随后图像被填充细节（例如纹理）。这一现象是因为在 DPM 添加噪声的前向过程（生成的初始阶段）中，噪声图像的低频信号（与形状相关）直到最终阶段才被破坏。受此观察启发，我们进而探究了文本提示中每个词元在这两个阶段的影响。通过对一组文本提示条件下的 T2I 生成进行系列实验，我们得出结论：在早期生成阶段，图像主要由文本提示中的特殊词元 [\texttt{EOS}] 决定，且文本提示中的信息已在此阶段传递。此后，扩散模型通过自身信息完成生成图像的细节。最后，我们提出应用这一观察，通过适当移除文本指导来加速 T2I 生成过程，最终使采样速度提升 25% 以上。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日