Loss spikes, a phenomenon in which the loss value diverges suddenly, is a fundamental issue in the pre-training of large language models. This paper supposes that the non-uniformity of the norm of the parameters is one of the causes of loss spikes. Here, in training of neural networks, the scale of the gradients is required to be kept constant throughout the layers to avoid the vanishing and exploding gradients problem. However, to meet these requirements in the Transformer model, the norm of the model parameters must be non-uniform, and thus, parameters whose norm is smaller are more sensitive to the parameter update. To address this issue, we propose a novel technique, weight scaling as reparameterization (WeSaR). WeSaR introduces a gate parameter per parameter matrix and adjusts it to the value satisfying the requirements. Because of the gate parameter, WeSaR sets the norm of the original parameters uniformly, which results in stable training. Experimental results with the Transformer decoders consisting of 130 million, 1.3 billion, and 13 billion parameters showed that WeSaR stabilizes and accelerates training and that it outperformed compared methods including popular initialization methods.
翻译:损失尖峰(即损失值突然发散的现象)是大语言模型预训练中的一个基本问题。本文假设参数范数的不均匀性是损失尖峰的成因之一。在神经网络训练中,为避免梯度消失与爆炸问题,需要使各层梯度尺度保持恒定。然而,在Transformer模型中满足此要求时,模型参数的范数必然呈现非均匀分布,从而导致范数较小的参数对参数更新更为敏感。为解决该问题,我们提出一种新技术——权重缩放重参数化(WeSaR)。WeSaR为每个参数矩阵引入门控参数,并将其调整至满足上述要求的值。通过门控参数的调节,WeSaR使原始参数的范数均匀分布,从而实现稳定训练。在包含1.3亿、13亿和130亿参数的Transformer解码器上的实验结果表明,WeSaR能稳定并加速训练过程,其性能优于包括主流初始化方法在内的对比方法。