Transformer architectures serve as the backbone for most modern Large Language Models, therefore their pretraining stability and convergence speed are of central concern. Motivated by the logical dependency of sequentially stacked layers, we propose Progressive Residual Warmup (ProRes) for language model pretraining. ProRes implements an "early layer learns first" philosophy by multiplying each layer's residual with a scalar that gradually warms up from 0 to 1, with deeper layers taking longer warmup steps. In this way, deeper layers wait for early layers to settle into a more stable regime before contributing to learning. We demonstrate the effectiveness of ProRes through pretraining experiments across various model scales, as well as normalization and initialization schemes. Comprehensive analysis shows that ProRes not only stabilizes pretraining but also introduces a unique optimization trajectory, leading to faster convergence, stronger generalization and better downstream performance. Our code is available at https://github.com/dandingsky/ProRes.
翻译:Transformer架构作为大多数现代大语言模型的核心,其预训练的稳定性与收敛速度至关重要。受序列堆叠层间逻辑依赖关系的启发,我们提出用于语言模型预训练的渐进残差预热方法。该方法通过将每层残差乘以一个从0逐渐升温至1的标量系数实现“浅层先学”的理念,其中深层网络需要更长的预热步数。通过这种方式,深层网络在参与学习前会等待浅层网络进入更稳定的状态。我们通过在不同模型规模、归一化方案及初始化策略下的预训练实验验证了该方法的有效性。综合分析表明,该方法不仅能稳定预训练过程,还能形成独特的优化轨迹,从而实现更快的收敛速度、更强的泛化能力以及更优的下游任务性能。代码已开源:https://github.com/dandingsky/ProRes。