State-space models (SSMs) have recently demonstrated competitive performance to transformers at large-scale language modeling benchmarks while achieving linear time and memory complexity as a function of sequence length. Mamba, a recently released SSM model, shows impressive performance in both language modeling and long sequence processing tasks. Simultaneously, mixture-of-expert (MoE) models have shown remarkable performance while significantly reducing the compute and latency costs of inference at the expense of a larger memory footprint. In this paper, we present BlackMamba, a novel architecture that combines the Mamba SSM with MoE to obtain the benefits of both. We demonstrate that BlackMamba performs competitively against both Mamba and transformer baselines, and outperforms in inference and training FLOPs. We fully train and open-source 340M/1.5B and 630M/2.8B BlackMamba models on 300B tokens of a custom dataset. We show that BlackMamba inherits and combines both of the benefits of SSM and MoE architectures, combining linear-complexity generation from SSM with cheap and fast inference from MoE. We release all weights, checkpoints, and inference code open-source. Inference code at: https://github.com/Zyphra/BlackMamba
翻译:状态空间模型(SSM)近期在大规模语言建模基准测试中展现出与Transformer相当的性能优势,同时实现了随序列长度呈线性增长的时间与空间复杂度。最新发布的Mamba模型在语言建模与长序列处理任务中均表现卓越。与此同时,混合专家(MoE)模型在显著降低推理计算与延迟成本的同时(以更大内存占用为代价),取得了令人瞩目的性能提升。本文提出BlackMamba——一种融合Mamba SSM与MoE的新型架构,兼具两者优势。实验表明,BlackMamba在与Mamba及Transformer基线模型的对比中表现优异,并在推理与训练FLOPs方面取得领先。我们基于自建数据集完成300B token量级的模型训练,完整开源了340M/1.5B与630M/2.8B两套BlackMamba模型参数。研究证实,BlackMamba继承并融合了SSM与MoE架构的双重优势:既保留了SSM的线性复杂度生成能力,又获得MoE带来的低成本高效推理特性。所有权重、检查点及推理代码均已开源,推理代码访问:https://github.com/Zyphra/BlackMamba