Attention mechanism has been crucial for image diffusion models, however, their quadratic computational complexity limits the sizes of images we can process within reasonable time and memory constraints. This paper investigates the importance of dense attention in generative image models, which often contain redundant features, making them suitable for sparser attention mechanisms. We propose a novel training-free method ToDo that relies on token downsampling of key and value tokens to accelerate Stable Diffusion inference by up to 2x for common sizes and up to 4.5x or more for high resolutions like 2048x2048. We demonstrate that our approach outperforms previous methods in balancing efficient throughput and fidelity.
翻译:注意力机制对图像扩散模型至关重要,但其二次计算复杂度限制了在合理时间和内存约束下能处理的图像尺寸。本文研究了生成式图像模型中密集注意力的必要性,这类模型往往包含冗余特征,使其适用于稀疏注意力机制。我们提出了一种新颖的无需训练的方法ToDo,通过对键和值Token进行下采样,将Stable Diffusion推理速度提升至常见尺寸下的2倍,以及2048x2048等高分辨率下的4.5倍甚至更多。我们证明,该方法在平衡高效吞吐量与保真度方面优于先前方法。