This work aims to improve the efficiency of text-to-image diffusion models. While diffusion models use computationally expensive UNet-based denoising operations in every generation step, we identify that not all operations are equally relevant for the final output quality. In particular, we observe that UNet layers operating on high-res feature maps are relatively sensitive to small perturbations. In contrast, low-res feature maps influence the semantic layout of the final image and can often be perturbed with no noticeable change in the output. Based on this observation, we propose Clockwork Diffusion, a method that periodically reuses computation from preceding denoising steps to approximate low-res feature maps at one or more subsequent steps. For multiple baselines, and for both text-to-image generation and image editing, we demonstrate that Clockwork leads to comparable or improved perceptual scores with drastically reduced computational complexity. As an example, for Stable Diffusion v1.5 with 8 DPM++ steps we save 32% of FLOPs with negligible FID and CLIP change.
翻译:本工作旨在提升文本到图像扩散模型的效率。尽管扩散模型在每个生成步骤中均采用计算成本高昂、基于UNet的去噪操作,但我们发现并非所有操作对最终输出质量具有同等重要性。具体而言,我们观察到处理高分辨率特征图的UNet层对微小扰动相对敏感,而低分辨率特征图则主要影响最终图像的语义布局,且往往可在输出中不产生明显变化的情况下被扰动。基于此发现,我们提出钟表扩散方法——通过周期性复用前序去噪步骤中的计算结果,近似后续一个或多个步骤中的低分辨率特征图。在多个基线模型、文本到图像生成及图像编辑任务中,我们证明钟表扩散能够在显著降低计算复杂度的同时,获得可比拟甚至更优的感知评分。以采用8步DPM++的Stable Diffusion v1.5为例,该方法在FID与CLIP指标变化可忽略不计的情况下,实现了32%的FLOPs节省。