An Empirical Study of Mamba-based Language Models

Roger Waleffe,Wonmin Byeon,Duncan Riach,Brandon Norick,Vijay Korthikanti,Tri Dao,Albert Gu,Ali Hatamizadeh,Sudhakar Singh,Deepak Narayanan,Garvit Kulshreshtha,Vartika Singh,Jared Casper,Jan Kautz,Mohammad Shoeybi,Bryan Catanzaro

Selective state-space models (SSMs) like Mamba overcome some of the shortcomings of Transformers, such as quadratic computational complexity with sequence length and large inference-time memory requirements from the key-value cache. Moreover, recent studies have shown that SSMs can match or exceed the language modeling capabilities of Transformers, making them an attractive alternative. In a controlled setting (e.g., same data), however, studies so far have only presented small scale experiments comparing SSMs to Transformers. To understand the strengths and weaknesses of these architectures at larger scales, we present a direct comparison between 8B-parameter Mamba, Mamba-2, and Transformer models trained on the same datasets of up to 3.5T tokens. We also compare these models to a hybrid architecture consisting of 43% Mamba-2, 7% attention, and 50% MLP layers (Mamba-2-Hybrid). Using a diverse set of tasks, we answer the question of whether Mamba models can match Transformers at larger training budgets. Our results show that while pure SSMs match or exceed Transformers on many tasks, they lag behind Transformers on tasks which require strong copying or in-context learning abilities (e.g., 5-shot MMLU, Phonebook) or long-context reasoning. In contrast, we find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks we evaluated (+2.65 points on average) and is predicted to be up to 8x faster when generating tokens at inference time. To validate long-context capabilities, we provide additional experiments evaluating variants of the Mamba-2-Hybrid and Transformer extended to support 16K, 32K, and 128K sequences. On an additional 23 long-context tasks, the hybrid model continues to closely match or exceed the Transformer on average. To enable further study, we release the checkpoints as well as the code used to train our models as part of NVIDIA's Megatron-LM project.

翻译：选择性状态空间模型（如Mamba）克服了Transformer的一些缺陷，例如随序列长度呈二次增长的计算复杂度以及推理时键值缓存带来的巨大内存开销。此外，近期研究表明，状态空间模型在语言建模能力上可以媲美甚至超越Transformer，使其成为一种颇具吸引力的替代方案。然而，在受控设置（例如使用相同数据）下，迄今的研究仅呈现了状态空间模型与Transformer之间的小规模对比实验。为深入理解这些架构在大规模场景下的优势与局限，我们对基于相同数据集（训练数据量高达3.5万亿词元）训练的80亿参数Mamba、Mamba-2及Transformer模型进行了直接比较。同时，我们还将这些模型与一种混合架构（包含43% Mamba-2层、7%注意力层和50%多层感知机层，即Mamba-2-Hybrid）进行了对比。通过多样化的任务评估，我们探讨了Mamba模型在较大训练预算下能否达到Transformer水平的问题。研究结果表明：虽然纯状态空间模型在多数任务上可媲美或超越Transformer，但在需要强复制能力或上下文学习能力（例如5样本MMLU、电话簿任务）以及长上下文推理的任务上仍落后于Transformer。相比之下，我们发现80亿参数的Mamba-2-Hybrid模型在所有评估的12项标准任务上均超越80亿参数Transformer（平均提升2.65分），且推理时生成词元的速度预计最高可提升8倍。为验证长上下文能力，我们补充了将Mamba-2-Hybrid与Transformer的变体扩展至支持16K、32K和128K序列长度的实验。在额外的23项长上下文任务中，混合模型整体上持续接近或超越Transformer平均表现。为促进后续研究，我们通过NVIDIA的Megatron-LM项目公开了模型检查点及训练代码。