Scaling large language models (LLMs) significantly improves performance but comes with prohibitive computational costs. Mixture-of-Experts (MoE) models offer an efficient alternative, increasing capacity without a proportional rise in compute requirements. However, training MoE models from scratch poses challenges like overfitting and routing instability. We present an efficient training recipe leveraging pre-trained dense checkpoints, training an 8-Expert Top-2 MoE model from Llama 3-8B with less than $1\%$ of typical pre-training compute. Our approach enhances downstream performance on academic benchmarks, achieving a $\textbf{2%}$ improvement in 0-shot accuracy on MMLU, while reaching a Model FLOPs Utilization (MFU) of $\textbf{46.8%}$ during training using our framework. We also integrate online upcycling in NeMo for seamless use of pre-trained weights, enabling cost-effective development of high-capacity MoE models.
翻译:扩展大型语言模型(LLM)能显著提升性能,但会带来高昂的计算成本。混合专家(MoE)模型提供了一种高效的替代方案,能在不按比例增加计算需求的前提下提升模型容量。然而,从头开始训练 MoE 模型面临着过拟合和路由不稳定性等挑战。我们提出了一种利用预训练密集模型检查点的高效训练方案,以低于典型预训练计算量 $1\%$ 的成本,从 Llama 3-8B 训练出一个 8 专家 Top-2 MoE 模型。我们的方法提升了模型在学术基准测试中的下游性能,在 MMLU 上实现了 $\textbf{2%}$ 的零样本准确率提升,同时在使用我们的框架训练时,模型浮点运算利用率(MFU)达到了 $\textbf{46.8%}$。我们还在 NeMo 中集成了在线升级改造功能,以便无缝使用预训练权重,从而支持高容量 MoE 模型的经济高效开发。