Bi-Mamba：迈向精确的1比特状态空间模型 (Bi-Mamba: Towards Accurate 1-Bit State Space Models)

The typical Selective State-Space Model (SSM) used in Mamba addresses several limitations of Transformers, such as the quadratic computational complexity with respect to sequence length and the significant memory requirements during inference due to the key-value (KV) cache. However, the increasing size of Mamba models continues to pose challenges for training and deployment, particularly due to their substantial computational demands during both training and inference. In this work, we introduce $\texttt{Bi-Mamba}$, a scalable and powerful 1-bit Mamba architecture designed to enable more efficient large language models (LLMs), with model sizes of 780M, 1.3B, and 2.7B parameters. $\texttt{Bi-Mamba}$ models are trained from scratch on a standard LLM-scale dataset using an autoregressive distillation loss. Extensive experiments on language modeling benchmarks demonstrate that $\texttt{Bi-Mamba}$ achieves performance comparable to its full-precision (FP16 or BF16) counterparts, while outperforming post-training binarization (PTB) Mamba and binarization-aware training (BAT) Transformer baselines. Moreover, $\texttt{Bi-Mamba}$ drastically reduces memory usage and computational cost compared to the original Mamba. Our work pioneers a new line of linear-complexity LLMs under low-bit representation and provides the way for the design of specialized hardware optimized for efficient 1-bit Mamba-based models. Code and the pre-trained weights are available at https://github.com/Tangshengku/Bi-Mamba.

翻译：Mamba中使用的典型选择性状态空间模型（SSM）解决了Transformer的一些局限性，例如序列长度的二次计算复杂度以及由于键值（KV）缓存导致推理期间显著的内存需求。然而，Mamba模型规模的持续增长仍然对训练和部署构成挑战，尤其是在训练和推理期间巨大的计算需求。在这项工作中，我们引入了$\texttt{Bi-Mamba}$，一种可扩展且强大的1比特Mamba架构，旨在实现更高效的大型语言模型（LLMs），其模型参数量分别为780M、1.3B和2.7B。$\texttt{Bi-Mamba}$模型是在标准的LLM规模数据集上使用自回归蒸馏损失从头开始训练的。在语言建模基准上的大量实验表明，$\texttt{Bi-Mamba}$实现了与其全精度（FP16或BF16）对应模型相当的性能，同时优于训练后二值化（PTB）的Mamba模型和二值化感知训练（BAT）的Transformer基线模型。此外，与原始Mamba相比，$\texttt{Bi-Mamba}$显著降低了内存使用和计算成本。我们的工作开创了低比特表示下线性复杂度LLMs的新方向，并为设计针对高效1比特Mamba模型优化的专用硬件铺平了道路。代码和预训练权重可在 https://github.com/Tangshengku/Bi-Mamba 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日