Text-to-image diffusion models have demonstrated unprecedented capabilities for flexible and realistic image synthesis. Nevertheless, these models rely on a time-consuming sampling procedure, which has motivated attempts to reduce their latency. When improving efficiency, researchers often use the original diffusion model to train an additional network designed specifically for fast image generation. In contrast, our approach seeks to reduce latency directly, without any retraining, fine-tuning, or knowledge distillation. In particular, we find the repeated calculation of attention maps to be costly yet redundant, and instead suggest reusing them during sampling. Our specific reuse strategies are based on ODE theory, which implies that the later a map is reused, the smaller the distortion in the final image. We empirically compare these reuse strategies with few-step sampling procedures of comparable latency, finding that reuse generates images that are closer to those produced by the original high-latency diffusion model.
翻译:文本到图像扩散模型已展现出前所未有的灵活且逼真的图像合成能力。然而,这些模型依赖于耗时的采样过程,这促使研究者尝试降低其延迟。在提升效率时,研究者通常使用原始扩散模型来训练专门用于快速图像生成的额外网络。相比之下,我们的方法旨在直接降低延迟,无需任何重新训练、微调或知识蒸馏。具体而言,我们发现注意力图的重复计算成本高昂且冗余,因此建议在采样过程中重用这些注意力图。我们提出的具体重用策略基于常微分方程理论,该理论表明注意力图重用时间越晚,最终图像的失真越小。我们通过实验将这些重用策略与具有相当延迟的少步采样方法进行比较,发现重用策略生成的图像更接近原始高延迟扩散模型产生的图像。