In recent years, with the rapid application of large language models across various fields, the scale of these models has gradually increased, and the resources required for their pre-training have grown exponentially. Training an LLM from scratch will cost a lot of computation resources while scaling up from a smaller model is a more efficient approach and has thus attracted significant attention. In this paper, we present AquilaMoE, a cutting-edge bilingual 8*16B Mixture of Experts (MoE) language model that has 8 experts with 16 billion parameters each and is developed using an innovative training methodology called EfficientScale. This approach optimizes performance while minimizing data requirements through a two-stage process. The first stage, termed Scale-Up, initializes the larger model with weights from a pre-trained smaller model, enabling substantial knowledge transfer and continuous pretraining with significantly less data. The second stage, Scale-Out, uses a pre-trained dense model to initialize the MoE experts, further enhancing knowledge transfer and performance. Extensive validation experiments on 1.8B and 7B models compared various initialization schemes, achieving models that maintain and reduce loss during continuous pretraining. Utilizing the optimal scheme, we successfully trained a 16B model and subsequently the 8*16B AquilaMoE model, demonstrating significant improvements in performance and training efficiency.
翻译:近年来,随着大语言模型在各领域的快速应用,模型规模逐渐增大,其预训练所需资源呈指数级增长。从头训练大语言模型将耗费大量计算资源,而从较小模型进行扩展则是一种更为高效的途径,因此备受关注。本文提出AquilaMoE,一种先进的8*16B双语专家混合模型,该模型包含8个专家,每个专家拥有160亿参数,并采用名为EfficientScale的创新训练方法开发。该方法通过两阶段流程在优化性能的同时最小化数据需求。第一阶段称为“扩展”,利用预训练小模型的权重初始化更大模型,实现显著的知识迁移,并以远少于常规需求的数据进行持续预训练。第二阶段称为“扩展”,使用预训练的稠密模型初始化MoE专家,进一步提升知识迁移与模型性能。基于18亿和70亿模型的广泛验证实验比较了多种初始化方案,实现了在持续预训练过程中保持或降低损失的模型。采用最优方案,我们成功训练了160亿模型及后续的8*16B AquilaMoE模型,在性能与训练效率方面均展现出显著提升。