We introduce AiM, an autoregressive (AR) image generative model based on Mamba architecture. AiM employs Mamba, a novel state-space model characterized by its exceptional performance for long-sequence modeling with linear time complexity, to supplant the commonly utilized Transformers in AR image generation models, aiming to achieve both superior generation quality and enhanced inference speed. Unlike existing methods that adapt Mamba to handle two-dimensional signals via multi-directional scan, AiM directly utilizes the next-token prediction paradigm for autoregressive image generation. This approach circumvents the need for extensive modifications to enable Mamba to learn 2D spatial representations. By implementing straightforward yet strategically targeted modifications for visual generative tasks, we preserve Mamba's core structure, fully exploiting its efficient long-sequence modeling capabilities and scalability. We provide AiM models in various scales, with parameter counts ranging from 148M to 1.3B. On the ImageNet1K 256*256 benchmark, our best AiM model achieves a FID of 2.21, surpassing all existing AR models of comparable parameter counts and demonstrating significant competitiveness against diffusion models, with 2 to 10 times faster inference speed. Code is available at https://github.com/hp-l33/AiM
翻译:我们提出了AiM,一种基于Mamba架构的自回归图像生成模型。AiM采用Mamba——一种以线性时间复杂度处理长序列建模并表现出卓越性能的新型状态空间模型——来取代自回归图像生成模型中常用的Transformer,旨在同时实现更优的生成质量和更高的推理速度。与现有通过多方向扫描使Mamba适应二维信号处理的方法不同,AiM直接利用下一令牌预测范式进行自回归图像生成。这种方法避免了为让Mamba学习二维空间表示而进行大量修改的需求。通过对视觉生成任务实施简单但具有战略针对性的修改,我们保留了Mamba的核心结构,充分利用了其高效的长序列建模能力和可扩展性。我们提供了不同规模的AiM模型,参数量从1.48亿到13亿不等。在ImageNet1K 256*256基准测试中,我们最佳的AiM模型取得了2.21的FID分数,超越了所有现有参数量相当的自回归模型,并对扩散模型展现出显著的竞争力,同时推理速度提升了2到10倍。代码发布于 https://github.com/hp-l33/AiM