AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies

Bo-Wen Zhang,Liangdong Wang,Ye Yuan,Jijie Li,Shuhao Gu,Mengdi Zhao,Xinya Wu,Guang Liu,Chengwei Wu,Hanyu Zhao,Li Du,Yiming Ju,Quanyue Ma,Yulong Ao,Yingli Zhao,Songhe Zhu,Zhou Cao,Dong Liang,Yonghua Lin,Ming Zhang,Shunfei Wang,Yanxin Zhou,Min Ye,Xuekai Chen,Xinyang Yu,Xiangjun Huang,Jian Yang

In recent years, with the rapid application of large language models across various fields, the scale of these models has gradually increased, and the resources required for their pre-training have grown exponentially. Training an LLM from scratch will cost a lot of computation resources while scaling up from a smaller model is a more efficient approach and has thus attracted significant attention. In this paper, we present AquilaMoE, a cutting-edge bilingual 8*16B Mixture of Experts (MoE) language model that has 8 experts with 16 billion parameters each and is developed using an innovative training methodology called EfficientScale. This approach optimizes performance while minimizing data requirements through a two-stage process. The first stage, termed Scale-Up, initializes the larger model with weights from a pre-trained smaller model, enabling substantial knowledge transfer and continuous pretraining with significantly less data. The second stage, Scale-Out, uses a pre-trained dense model to initialize the MoE experts, further enhancing knowledge transfer and performance. Extensive validation experiments on 1.8B and 7B models compared various initialization schemes, achieving models that maintain and reduce loss during continuous pretraining. Utilizing the optimal scheme, we successfully trained a 16B model and subsequently the 8*16B AquilaMoE model, demonstrating significant improvements in performance and training efficiency.

翻译：近年来，随着大语言模型在各领域的快速应用，模型规模逐渐增大，其预训练所需资源呈指数级增长。从头训练大语言模型将耗费大量计算资源，而从较小模型进行扩展则是一种更为高效的途径，因此备受关注。本文提出AquilaMoE，一种先进的8*16B双语专家混合模型，该模型包含8个专家，每个专家拥有160亿参数，并采用名为EfficientScale的创新训练方法开发。该方法通过两阶段流程在优化性能的同时最小化数据需求。第一阶段称为“扩展”，利用预训练小模型的权重初始化更大模型，实现显著的知识迁移，并以远少于常规需求的数据进行持续预训练。第二阶段称为“扩展”，使用预训练的稠密模型初始化MoE专家，进一步提升知识迁移与模型性能。基于18亿和70亿模型的广泛验证实验比较了多种初始化方案，实现了在持续预训练过程中保持或降低损失的模型。采用最优方案，我们成功训练了160亿模型及后续的8*16B AquilaMoE模型，在性能与训练效率方面均展现出显著提升。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日