Diffusion Transformers (DiT) have become a leading architecture in image generation. However, the quadratic complexity of attention mechanisms, which are responsible for modeling token-wise relationships, results in significant latency when generating high-resolution images. To address this issue, we aim at a linear attention mechanism in this paper that reduces the complexity of pre-trained DiTs to linear. We begin our exploration with a comprehensive summary of existing efficient attention mechanisms and identify four key factors crucial for successful linearization of pre-trained DiTs: locality, formulation consistency, high-rank attention maps, and feature integrity. Based on these insights, we introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token, and thus achieves linear complexity. Our experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity, yielding results comparable to the teacher model. Simultaneously, it reduces attention computations by 99.5% and accelerates generation by 6.3 times for generating 8K-resolution images. Furthermore, we investigate favorable properties in the distilled attention layers, such as zero-shot generalization cross various models and plugins, and improved support for multi-GPU parallel inference. Models and codes are available here: https://github.com/Huage001/CLEAR.
翻译:扩散Transformer(DiT)已成为图像生成领域的主流架构。然而,负责建模令牌间关系的注意力机制具有二次复杂度,导致生成高分辨率图像时存在显著延迟。为解决此问题,本文旨在实现一种线性注意力机制,将预训练DiT的复杂度降低至线性。我们首先系统总结了现有高效注意力机制,并识别出成功线性化预训练DiT所需的四个关键因素:局部性、形式一致性、高秩注意力图与特征完整性。基于这些发现,我们提出了一种类卷积局部注意力策略CLEAR,该策略将特征交互限制在每个查询令牌的局部窗口内,从而实现线性复杂度。实验表明,仅需使用1万个自生成样本对注意力层进行1万次迭代微调,即可将预训练DiT的知识有效迁移至具有线性复杂度的学生模型,其生成效果与教师模型相当。同时,该方法在生成8K分辨率图像时减少了99.5%的注意力计算量,并实现6.3倍的生成加速。此外,我们探究了蒸馏注意力层的优良特性,包括跨模型与插件的零样本泛化能力,以及增强的多GPU并行推理支持。模型与代码已开源:https://github.com/Huage001/CLEAR。