This work aims to improve the efficiency of text-to-image diffusion models. While diffusion models use computationally expensive UNet-based denoising operations in every generation step, we identify that not all operations are equally relevant for the final output quality. In particular, we observe that UNet layers operating on high-res feature maps are relatively sensitive to small perturbations. In contrast, low-res feature maps influence the semantic layout of the final image and can often be perturbed with no noticeable change in the output. Based on this observation, we propose Clockwork Diffusion, a method that periodically reuses computation from preceding denoising steps to approximate low-res feature maps at one or more subsequent steps. For multiple baselines, and for both text-to-image generation and image editing, we demonstrate that Clockwork leads to comparable or improved perceptual scores with drastically reduced computational complexity. As an example, for Stable Diffusion v1.5 with 8 DPM++ steps we save 32% of FLOPs with negligible FID and CLIP change.
翻译:本文旨在提升文本到图像扩散模型的效率。尽管扩散模型在每个生成步骤中都使用了计算昂贵的基于UNet的去噪操作,但我们发现并非所有操作对最终输出质量都同等重要。具体而言,我们观察到处理高分辨率特征图的UNet层对微小扰动相对敏感,而低分辨率特征图则影响最终图像的语义布局,且通常可在不引起输出显著变化的情况下进行扰动。基于这一发现,我们提出了Clockwork扩散(Clockwork Diffusion)方法,该方法周期性地复用前序去噪步骤中的计算结果,以近似后续一个或多个步骤中的低分辨率特征图。在多个基线模型上,针对文本到图像生成与图像编辑任务,我们证明Clockwork方法在显著降低计算复杂度的同时,可达到可比或更优的感知评分。例如,对于使用8步DPM++采样的Stable Diffusion v1.5模型,我们在FID和CLIP分数变化可忽略的情况下节省了32%的FLOPs计算量。