Attention mechanism has been crucial for image diffusion models, however, their quadratic computational complexity limits the sizes of images we can process within reasonable time and memory constraints. This paper investigates the importance of dense attention in generative image models, which often contain redundant features, making them suitable for sparser attention mechanisms. We propose a novel training-free method ToDo that relies on token downsampling of key and value tokens to accelerate Stable Diffusion inference by up to 2x for common sizes and up to 4.5x or more for high resolutions like 2048x2048. We demonstrate that our approach outperforms previous methods in balancing efficient throughput and fidelity.
翻译:注意力机制对图像扩散模型至关重要,但其二次计算复杂度限制了在合理时间和内存约束下可处理的图像尺寸。本文研究了生成式图像模型中密集注意力机制的重要性,发现其中常存在冗余特征,使其适用于更稀疏的注意力机制。我们提出了一种无需训练的新方法ToDo,该方法通过对关键令牌和值令牌进行下采样,将Stable Diffusion推理速度提升至常规尺寸下的2倍,在高分辨率(如2048×2048)下可达4.5倍以上。实验表明,我们的方法在平衡高效吞吐量与保真度方面优于既有方法。