We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU. Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length. We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture. We make the weights of our implementation of Jamba publicly available under a permissive license.
翻译:我们提出Jamba,一种基于新型混合Transformer-Mamba专家混合(MoE)架构的基础大语言模型。具体而言,Jamba交错堆叠Transformer层与Mamba层,从而兼具两类模型家族的优势。部分层中引入了MoE以增加模型容量,同时保持激活参数量可控。这种灵活的架构允许根据资源和目标进行特定配置。在我们实现的特定配置中,最终得到的强大模型可容纳于单个80GB GPU内。通过大规模构建,Jamba相比原始Transformer实现了高吞吐量与低内存占用,同时在标准语言模型基准测试和长上下文评估中达到最先进的性能。值得注意的是,该模型在长达256K词符的上下文长度上表现出色。我们研究了多种架构决策,例如如何组合Transformer与Mamba层、如何混合专家,并证明其中某些决策在大规模建模中至关重要。我们还描述了Jamba训练与评估过程中揭示的此类架构的若干有趣特性,并计划发布不同消融实验的检查点,以促进对这一新型架构的进一步探索。我们将在宽松许可下公开Jamba实现版本的权重。