Recent Transformer-based diffusion models have shown remarkable performance, largely attributed to the ability of the self-attention mechanism to accurately capture both global and local contexts by computing all-pair interactions among input tokens. However, their quadratic complexity poses significant computational challenges for long-sequence inputs. Conversely, a recent state space model called Mamba offers linear complexity by compressing a filtered global context into a hidden state. Despite its efficiency, compression inevitably leads to information loss of fine-grained local dependencies among tokens, which are crucial for effective visual generative modeling. Motivated by these observations, we introduce Local Attentional Mamba (LaMamba) blocks that combine the strengths of self-attention and Mamba, capturing both global contexts and local details with linear complexity. Leveraging the efficient U-Net architecture, our model exhibits exceptional scalability and surpasses the performance of DiT across various model scales on ImageNet at 256x256 resolution, all while utilizing substantially fewer GFLOPs and a comparable number of parameters. Compared to state-of-the-art diffusion models on ImageNet 256x256 and 512x512, our largest model presents notable advantages, such as a reduction of up to 62% GFLOPs compared to DiT-XL/2, while achieving superior performance with comparable or fewer parameters. Our code is available at https://github.com/yunxiangfu2001/LaMamba-Diff.
翻译:近年来,基于Transformer的扩散模型展现出卓越的性能,这主要归功于自注意力机制通过计算输入标记之间的全对交互,能够精确捕捉全局与局部上下文的能力。然而,其二次复杂度对长序列输入构成了显著的计算挑战。相反,一种名为Mamba的最新状态空间模型通过将经过滤波的全局上下文压缩至一个隐藏状态,实现了线性复杂度。尽管效率很高,但压缩不可避免地导致标记间细粒度局部依赖关系的信息损失,而这对于有效的视觉生成建模至关重要。基于这些观察,我们引入了局部注意力Mamba(LaMamba)模块,它结合了自注意力与Mamba的优势,以线性复杂度同时捕捉全局上下文与局部细节。利用高效的U-Net架构,我们的模型展现出卓越的可扩展性,并在ImageNet 256x256分辨率下,以显著更少的GFLOPs和相当的参数量,超越了DiT在不同模型规模上的性能。与ImageNet 256x256和512x512上的最先进扩散模型相比,我们最大的模型展现出显著优势,例如与DiT-XL/2相比,GFLOPs最多减少62%,同时在使用相当或更少参数的情况下实现了更优的性能。我们的代码可在 https://github.com/yunxiangfu2001/LaMamba-Diff 获取。