Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical $\underline{\textit{O}}$bstacles: ($\textit{O}$1) lack of comprehensive evaluation, ($\textit{O}$2) untested viability for scaling, and ($\textit{O}$3) lack of empirical guidelines. To tackle $\textit{O}$1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLM pre-training setting. Our findings reveal that a depthwise stacking operator, called $G_{\text{stack}}$, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP benchmarks compared to strong baselines. Motivated by these promising results, we conduct extensive experiments to delve deeper into $G_{\text{stack}}$ to address $\textit{O}$2 and $\textit{O}$3. For $\textit{O}$2 (untested scalability), our study shows that $G_{\text{stack}}$ is scalable and consistently performs well, with experiments up to 7B LLMs after growth and pre-training LLMs with 750B tokens. For example, compared to a conventionally trained 7B model using 300B tokens, our $G_{\text{stack}}$ model converges to the same loss with 194B tokens, resulting in a 54.6\% speedup. We further address $\textit{O}$3 (lack of empirical guidelines) by formalizing guidelines to determine growth timing and growth factor for $G_{\text{stack}}$, making it practical in general LLM pre-training. We also provide in-depth discussions and comprehensive ablation studies of $G_{\text{stack}}$. Our code and pre-trained model are available at https://llm-stacking.github.io.

翻译：由于规模庞大，大语言模型（LLM）的预训练计算成本高昂。模型增长作为一种有前景的方法，通过利用较小模型来加速较大模型的训练。然而，这些模型增长方法在高效LLM预训练中的可行性仍未得到充分探索。本工作识别了三个关键障碍：（O1）缺乏全面评估，（O2）可扩展性未经测试，以及（O3）缺乏实证指导。为应对O1，我们将现有方法总结为四种原子增长算子，并在标准化的LLM预训练设置中系统评估它们。我们的研究结果表明，一种称为$G_{\text{stack}}$的深度堆叠算子在训练中表现出显著的加速效果，与强基线相比，在八个标准NLP基准测试上实现了损失降低和整体性能提升。受这些积极结果的启发，我们进行了大量实验以更深入地研究$G_{\text{stack}}$，以解决O2和O3。对于O2（未经测试的可扩展性），我们的研究表明$G_{\text{stack}}$具有良好的可扩展性且性能稳定，实验规模达到增长后的7B参数LLM以及使用750B tokens进行预训练的LLM。例如，与使用300B tokens常规训练的7B模型相比，我们的$G_{\text{stack}}$模型仅用194B tokens即收敛至相同损失，实现了54.6%的加速。我们进一步通过形式化指导原则来解决O3（缺乏实证指导），以确定$G_{\text{stack}}$的增长时机和增长因子，使其在通用LLM预训练中具有实用性。我们还提供了对$G_{\text{stack}}$的深入讨论和全面的消融研究。我们的代码和预训练模型可在 https://llm-stacking.github.io 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日