Loss spikes often occur during pre-training of large language models. The spikes degrade the performance of large language models and sometimes ruin the pre-training. Since the pre-training needs a vast computational budget, we should avoid such spikes. To investigate the cause of loss spikes, we focus on gradients of internal layers. Through theoretical analyses, we reveal two causes of the exploding gradients, and provide requirements to prevent the explosion. In addition, we propose a method to satisfy the requirements by combining the initialization method and a simple modification to embeddings. We conduct various experiments to verify our theoretical analyses empirically. Experimental results indicate that the combination is effective in preventing spikes during pre-training.
翻译:在大语言模型的预训练过程中,损失尖峰(Loss Spikes)现象频繁出现。这些尖峰会降低大语言模型的性能,有时甚至会破坏预训练过程。由于预训练需要巨大的计算预算,我们应当避免此类尖峰。为探究损失尖峰的成因,我们重点关注内部层的梯度。通过理论分析,我们揭示了梯度爆炸的两个原因,并提出了防止梯度爆炸的条件。此外,我们提出了一种方法,通过结合初始化方法与对嵌入层的简单修改来满足这些条件。我们进行了多项实验,从实证角度验证了理论分析。实验结果表明,该组合方案能有效防止预训练过程中的尖峰现象。