Acceleration of large language model pre-training is a critical issue in present NLP research. In this paper, we focus on speeding up pre-training by progressively growing from a small Transformer structure to a large one. There are two main research problems related to progressive growth: growth schedule and growth operator. For growth schedule, existing work has explored multi-stage expansion of depth and feedforward layers. However, the impact of each dimension on the schedule's efficiency is still an open question. For growth operator, existing work relies on the initialization of new weights to inherit knowledge, and achieve only non-strict function preservation, limiting further optimization of training dynamics. To address these issues, we propose Masked Structural Growth (MSG), including growth schedules involving all possible dimensions and strictly function-preserving growth operators that is independent of the initialization of new weights. Experiments show that MSG is significantly faster than related work: we achieve a speed-up of 80% for Bert-base and 120% for Bert-large pre-training. Moreover, MSG is able to improve fine-tuning performances at the same time.
翻译:大语言模型预训练的加速是当前自然语言处理研究中的关键问题。本文聚焦于通过从小型Transformer结构逐步扩展至大型结构来加速预训练过程。渐进式增长涉及两个主要研究问题:增长调度与增长算子。针对增长调度,现有工作已探索了深度与前馈网络层的多阶段扩展,然而各维度对调度效率的影响仍是未解问题。针对增长算子,现有工作依赖新权重的初始化来继承知识,且仅实现非严格函数保持,限制了训练动态的进一步优化。为解决上述问题,我们提出掩码结构增长(MSG),该方法包含涉及所有可能维度的增长调度,以及严格保持函数的增长算子——该算子不依赖于新权重的初始化。实验表明,MSG显著快于相关工作:在BERT-base和BERT-large的预训练中,我们分别实现了80%和120%的加速。此外,MSG还能同时提升微调性能。