LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical $\underline{\textit{O}}$bstacles: ($\textit{O}$1) lack of comprehensive evaluation, ($\textit{O}$2) untested viability for scaling, and ($\textit{O}$3) lack of empirical guidelines. To tackle $\textit{O}$1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLM pre-training setting. Our findings reveal that a depthwise stacking operator, called $G_{\text{stack}}$, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP benchmarks compared to strong baselines. Motivated by these promising results, we conduct extensive experiments to delve deeper into $G_{\text{stack}}$ to address $\textit{O}$2 and $\textit{O}$3. For $\textit{O}$2 (untested scalability), our study shows that $G_{\text{stack}}$ is scalable and consistently performs well, with experiments up to 7B LLMs after growth and pre-training LLMs with 750B tokens. For example, compared to a conventionally trained 7B model using 300B tokens, our $G_{\text{stack}}$ model converges to the same loss with 194B tokens, resulting in a 54.6\% speedup. We further address $\textit{O}$3 (lack of empirical guidelines) by formalizing guidelines to determine growth timing and growth factor for $G_{\text{stack}}$, making it practical in general LLM pre-training. We also provide in-depth discussions and comprehensive ablation studies of $G_{\text{stack}}$. Our code and pre-trained model are available at https://llm-stacking.github.io.
翻译:由于规模庞大,大语言模型(LLM)的预训练计算成本高昂。模型增长作为一种有前景的方法,通过利用较小模型来加速较大模型的训练。然而,这些模型增长方法在高效LLM预训练中的可行性仍未得到充分探索。本工作识别了三个关键障碍:(O1)缺乏全面评估,(O2)可扩展性未经测试,以及(O3)缺乏实证指导。为应对O1,我们将现有方法总结为四种原子增长算子,并在标准化的LLM预训练设置中系统评估它们。我们的研究结果表明,一种称为$G_{\text{stack}}$的深度堆叠算子在训练中表现出显著的加速效果,与强基线相比,在八个标准NLP基准测试上实现了损失降低和整体性能提升。受这些积极结果的启发,我们进行了大量实验以更深入地研究$G_{\text{stack}}$,以解决O2和O3。对于O2(未经测试的可扩展性),我们的研究表明$G_{\text{stack}}$具有良好的可扩展性且性能稳定,实验规模达到增长后的7B参数LLM以及使用750B tokens进行预训练的LLM。例如,与使用300B tokens常规训练的7B模型相比,我们的$G_{\text{stack}}$模型仅用194B tokens即收敛至相同损失,实现了54.6%的加速。我们进一步通过形式化指导原则来解决O3(缺乏实证指导),以确定$G_{\text{stack}}$的增长时机和增长因子,使其在通用LLM预训练中具有实用性。我们还提供了对$G_{\text{stack}}$的深入讨论和全面的消融研究。我们的代码和预训练模型可在 https://llm-stacking.github.io 获取。