Attention mechanism has been crucial for image diffusion models, however, their quadratic computational complexity limits the sizes of images we can process within reasonable time and memory constraints. This paper investigates the importance of dense attention in generative image models, which often contain redundant features, making them suitable for sparser attention mechanisms. We propose a novel training-free method ToDo that relies on token downsampling of key and value tokens to accelerate Stable Diffusion inference by up to 2x for common sizes and up to 4.5x or more for high resolutions like 2048x2048. We demonstrate that our approach outperforms previous methods in balancing efficient throughput and fidelity.
翻译:注意力机制对于图像扩散模型至关重要,但其二次计算复杂度限制了我们在合理时间和内存约束下能够处理的图像尺寸。本文研究了生成式图像模型中密集注意力的重要性——这些模型常包含冗余特征,因此适用于更稀疏的注意力机制。我们提出了一种新颖的无训练方法ToDo,该方法通过对键和值令牌进行下采样,可将Stable Diffusion推理速度提升至常用尺寸的2倍,并在2048×2048等高分辨率下实现4.5倍乃至更高的加速。实验表明,我们的方法在平衡高效吞吐量与保真度方面优于现有方法。