We introduce Llamba, a family of efficient recurrent language models distilled from Llama-3.x into the Mamba architecture. The series includes Llamba-1B, Llamba-3B, and Llamba-8B, which achieve higher inference throughput and handle significantly larger batch sizes than Transformer-based models while maintaining comparable benchmark performance. Furthermore, Llamba demonstrates the effectiveness of cross-architecture distillation using MOHAWK (Bick et al., 2024), achieving these results with less than 0.1% of the training data typically used for models of similar size. To take full advantage of their efficiency, we provide an optimized implementation of Llamba for resource-constrained devices such as smartphones and edge platforms, offering a practical and memory-efficient alternative to Transformers. Overall, Llamba improves the tradeoff between speed, memory efficiency, and performance, making high-quality language models more accessible.
翻译:我们提出了Llamba,一个从Llama-3.x蒸馏至Mamba架构的高效循环语言模型系列。该系列包括Llamba-1B、Llamba-3B和Llamba-8B,它们在保持可比基准性能的同时,实现了比基于Transformer的模型更高的推理吞吐量,并能处理显著更大的批处理规模。此外,Llamba通过使用MOHAWK(Bick等人,2024)进行跨架构蒸馏,证明了该方法的有效性,仅使用通常训练类似规模模型所需数据量的不到0.1%就取得了这些成果。为充分利用其效率优势,我们为智能手机和边缘平台等资源受限设备提供了Llamba的优化实现,为Transformer提供了一个实用且内存高效的替代方案。总体而言,Llamba改善了速度、内存效率与性能之间的权衡,使高质量语言模型更易于获取。