LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba

Recent Transformer-based diffusion models have shown remarkable performance, largely attributed to the ability of the self-attention mechanism to accurately capture both global and local contexts by computing all-pair interactions among input tokens. However, their quadratic complexity poses significant computational challenges for long-sequence inputs. Conversely, a recent state space model called Mamba offers linear complexity by compressing a filtered global context into a hidden state. Despite its efficiency, compression inevitably leads to information loss of fine-grained local dependencies among tokens, which are crucial for effective visual generative modeling. Motivated by these observations, we introduce Local Attentional Mamba (LaMamba) blocks that combine the strengths of self-attention and Mamba, capturing both global contexts and local details with linear complexity. Leveraging the efficient U-Net architecture, our model exhibits exceptional scalability and surpasses the performance of DiT across various model scales on ImageNet at 256x256 resolution, all while utilizing substantially fewer GFLOPs and a comparable number of parameters. Compared to state-of-the-art diffusion models on ImageNet 256x256 and 512x512, our largest model presents notable advantages, such as a reduction of up to 62% GFLOPs compared to DiT-XL/2, while achieving superior performance with comparable or fewer parameters. Our code is available at https://github.com/yunxiangfu2001/LaMamba-Diff.

翻译：近年来，基于Transformer的扩散模型展现出卓越的性能，这主要归功于自注意力机制通过计算输入标记之间的全对交互，能够精确捕捉全局与局部上下文的能力。然而，其二次复杂度对长序列输入构成了显著的计算挑战。相反，一种名为Mamba的最新状态空间模型通过将经过滤波的全局上下文压缩至一个隐藏状态，实现了线性复杂度。尽管效率很高，但压缩不可避免地导致标记间细粒度局部依赖关系的信息损失，而这对于有效的视觉生成建模至关重要。基于这些观察，我们引入了局部注意力Mamba（LaMamba）模块，它结合了自注意力与Mamba的优势，以线性复杂度同时捕捉全局上下文与局部细节。利用高效的U-Net架构，我们的模型展现出卓越的可扩展性，并在ImageNet 256x256分辨率下，以显著更少的GFLOPs和相当的参数量，超越了DiT在不同模型规模上的性能。与ImageNet 256x256和512x512上的最先进扩散模型相比，我们最大的模型展现出显著优势，例如与DiT-XL/2相比，GFLOPs最多减少62%，同时在使用相当或更少参数的情况下实现了更优的性能。我们的代码可在 https://github.com/yunxiangfu2001/LaMamba-Diff 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日