Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber,Barak Lenz,Hofit Bata,Gal Cohen,Jhonathan Osin,Itay Dalmedigos,Erez Safahi,Shaked Meirom,Yonatan Belinkov,Shai Shalev-Shwartz,Omri Abend,Raz Alon,Tomer Asida,Amir Bergman,Roman Glozman,Michael Gokhman,Avashalom Manevich,Nir Ratner,Noam Rozen,Erez Shwartz,Mor Zusman,Yoav Shoham

from arxiv, Webpage: https://www.ai21.com/jamba

We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU. Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length. We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture. We make the weights of our implementation of Jamba publicly available under a permissive license.

翻译：我们提出了Jamba，一种基于新型混合Transformer-Mamba混合专家（MoE）架构的基础大语言模型。具体而言，Jamba交错排列Transformer层和Mamba层，同时受益于两类模型家族的优势。MoE被添加至部分层中，以在保持活跃参数用量可控的前提下提升模型容量。这种灵活架构允许根据资源和目标进行特定配置。在所实现的特定配置中，我们最终得到一个能够部署于单个80GB GPU的强大模型。在大规模构建下，Jamba相较于传统Transformer具有高吞吐量和低内存占用，同时在标准语言模型基准测试和长上下文评估中达到最先进性能。值得注意的是，该模型在长达256K token的上下文长度上表现出色。我们研究了多种架构决策，例如如何组合Transformer层与Mamba层，以及如何混合专家，并表明其中某些决策对大规模建模至关重要。我们还描述了Jamba训练与评估所揭示的这些架构的若干有趣特性，并计划发布来自不同消融实验的检查点，以鼓励对该新型架构的进一步探索。我们将Jamba实现的权重以宽松许可协议公开发布。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日