Accelerating large language model pre-training is a critical issue in present research. In this paper, we focus on speeding up pre-training by progressively growing from a small Transformer structure to a large one. There are two main research problems associated with progressive growth: determining the optimal growth schedule, and designing efficient growth operators. In terms of growth schedule, the impact of each single dimension on a schedule's efficiency is under-explored by existing work. Regarding the growth operators, existing methods rely on the initialization of new weights to inherit knowledge, and achieve only non-strict function preservation, limiting further improvements on training dynamics. To address these issues, we propose Masked Structural Growth (MSG), including (i) growth schedules involving all possible dimensions and (ii) strictly function-preserving growth operators that is independent of the initialization of new weights. Experiments show that MSG is significantly faster than related work: we achieve up to 2.2x speedup in pre-training different types of language models while maintaining comparable or better downstream performances. Code is publicly available at https://github.com/cofe-ai/MSG.
翻译:加速大型语言模型预训练是当前研究的关键问题。本文聚焦于通过从小型Transformer结构逐步扩展至大型结构来加快预训练速度。渐进式增长涉及两个主要研究问题:确定最优增长调度方案与设计高效增长算子。在增长调度方面,现有研究尚未深入探索各单一维度对调度效率的影响;在增长算子方面,现有方法依赖新权重初始化来继承知识,仅实现非严格函数保持,限制了训练动态的进一步改进。针对这些问题,我们提出掩码结构增长(Masked Structural Growth, MSG),包含:(i) 涵盖所有可能维度的增长调度方案,以及(ii) 独立于新权重初始化的严格函数保持增长算子。实验表明,MSG显著快于现有相关工作:在保持可比甚至更优下游性能的前提下,预训练不同类型语言模型的速度最高提升2.2倍。代码已公开于https://github.com/cofe-ai/MSG。