Small-scale proxies for large-scale Transformer training instabilities

Mitchell Wortsman,Peter J. Liu,Lechao Xiao,Katie Everett,Alex Alemi,Ben Adlam,John D. Co-Reyes,Izzeddin Gur,Abhishek Kumar,Roman Novak,Jeffrey Pennington,Jascha Sohl-dickstein,Kelvin Xu,Jaehoon Lee,Justin Gilmer,Simon Kornblith

Teams that have trained large Transformer-based models have reported training instabilities at large scale that did not appear when training with the same hyperparameters at smaller scales. Although the causes of such instabilities are of scientific interest, the amount of resources required to reproduce them has made investigation difficult. In this work, we seek ways to reproduce and study training stability and instability at smaller scales. First, we focus on two sources of training instability described in previous work: the growth of logits in attention layers (Dehghani et al., 2023) and divergence of the output logits from the log probabilities (Chowdhery et al., 2022). By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates, and that mitigations previously employed at large scales are equally effective in this regime. This prompts us to investigate the extent to which other known optimizer and model interventions influence the sensitivity of the final loss to changes in the learning rate. To this end, we study methods such as warm-up, weight decay, and the $\mu$Param (Yang et al., 2022), and combine techniques to train small models that achieve similar losses across orders of magnitude of learning rate variation. Finally, to conclude our exploration we study two cases where instabilities can be predicted before they emerge by examining the scaling behavior of model activation and gradient norms.

翻译：训练大规模Transformer模型的团队曾报告，在相同超参数下，小规模训练时未出现的不稳定性会在大规模训练中显现。尽管此类不稳定的成因具有科学价值，但重现它们所需的大量资源使得相关研究困难重重。本研究旨在探索在较小规模下重现和研究训练稳定性与不稳定性的方法。首先，我们聚焦于先前工作中描述的两类训练不稳定性来源：注意力层中logits的增长（Dehghani等人，2023）以及输出logits与对数概率的偏离（Chowdhery等人，2022）。通过测量不同规模下学习率与损失之间的关系，我们证明这些不稳定性同样会出现在高学习率训练的小模型中，且此前在大规模场景下采用的缓解措施在此情况下同样有效。这促使我们探究其他已知优化器与模型干预手段对最终损失对学习率变化敏感性的影响程度。为此，我们研究了预热、权重衰减及μParam（Yang等人，2022）等方法，通过组合技术训练小模型，使其在跨数量级的学习率变化下达到相近的损失值。最后，作为探索的收尾，我们通过考察模型激活值与梯度范数的标度行为，研究了两种可在不稳定性显现前进行预测的案例。