The diffusion model has long been plagued by scalability and quadratic complexity issues, especially within transformer-based structures. In this study, we aim to leverage the long sequence modeling capability of a State-Space Model called Mamba to extend its applicability to visual data generation. Firstly, we identify a critical oversight in most current Mamba-based vision methods, namely the lack of consideration for spatial continuity in the scan scheme of Mamba. Secondly, building upon this insight, we introduce a simple, plug-and-play, zero-parameter method named Zigzag Mamba, which outperforms Mamba-based baselines and demonstrates improved speed and memory utilization compared to transformer-based baselines. Lastly, we integrate Zigzag Mamba with the Stochastic Interpolant framework to investigate the scalability of the model on large-resolution visual datasets, such as FacesHQ $1024\times 1024$ and UCF101, MultiModal-CelebA-HQ, and MS COCO $256\times 256$. Code will be released at https://taohu.me/zigma/
翻译:扩散模型长期受限于可扩展性和二次复杂度问题,尤其是在基于Transformer的结构中。本研究旨在利用状态空间模型Mamba的长序列建模能力,将其扩展至视觉数据生成领域。首先,我们指出现有基于Mamba的视觉方法中存在一个关键疏漏——即未充分考虑Mamba扫描方案中的空间连续性。其次,基于这一洞见,我们提出一种简单、即插即用、零参数的方法——之字形曼巴(Zigzag Mamba),该方法不仅优于基于Mamba的基线模型,与基于Transformer的基线相比也展现出更优的速度和内存利用率。最后,我们将之字形曼巴与随机插值框架相结合,研究模型在大分辨率视觉数据集(如FacesHQ $1024\times 1024$、UCF101、多模态CelebA-HQ以及MS COCO $256\times 256$)上的可扩展性。代码将发布在 https://taohu.me/zigma/