Training stability of large language models(LLMs) is an important research topic. Reproducing training instabilities can be costly, so we use a small language model with 830M parameters and experiment with higher learning rates to force models to diverge. One of the sources of training instability is the growth of logits in attention layers. We extend the focus of the previous work and look not only at the magnitude of the logits but at all outputs of linear layers in the Transformer block. We observe that with a high learning rate the L2 norm of all linear layer outputs can grow with each training step and the model diverges. Specifically we observe that QKV, Proj and FC2 layers have the largest growth of the output magnitude. This prompts us to explore several options: 1) apply layer normalization not only after QK layers but also after Proj and FC2 layers too; 2) apply layer normalization after the QKV layer (and remove pre normalization). 3) apply QK layer normalization together with softmax capping. We show that with the last two methods we can increase learning rate by 1.5x (without model divergence) in comparison to an approach based on QK layer normalization only. Also we observe significant perplexity improvements for all three methods in comparison to the baseline model.
翻译:大语言模型(LLMs)的训练稳定性是一个重要的研究课题。复现训练不稳定现象可能成本高昂,因此我们使用一个具有8.3亿参数的小型语言模型,并通过实验采用更高的学习率来迫使模型发散。训练不稳定的来源之一是注意力层中logits的增长。我们扩展了先前工作的关注点,不仅考察logits的幅度,还考察Transformer块中所有线性层的输出。我们观察到,在高学习率下,所有线性层输出的L2范数会随着每个训练步骤而增长,最终导致模型发散。具体而言,我们观察到QKV、Proj和FC2层的输出幅度增长最为显著。这促使我们探索以下几种方案:1)不仅在QK层之后应用层归一化,也在Proj和FC2层之后应用;2)在QKV层之后应用层归一化(并移除预归一化);3)将QK层归一化与softmax截断结合使用。我们证明,与仅基于QK层归一化的方法相比,采用后两种方法可以将学习率提高1.5倍(且模型不发散)。同时,与基线模型相比,我们观察到所有三种方法均带来了显著的困惑度提升。