Pretraining of large language models is not only expensive but also prone to certain training instabilities. A specific instability that often occurs for large learning rates at the end of training is output logit divergence. The most widely used mitigation strategy, z-loss, merely addresses the symptoms rather than the underlying cause of the problem. In this paper, we analyze the instability from the perspective of the output embeddings' geometry and identify its cause. Based on this, we propose output embedding centering (OEC) as a new mitigation strategy, and prove that it suppresses output logit divergence. OEC can be implemented in two different ways, as a deterministic operation called μ-centering, or a regularization method called μ-loss. Our experiments show that both variants outperform z-loss in terms of training stability and learning rate sensitivity. In particular, they ensure that training converges even for large learning rates when z-loss fails. Furthermore, we find that μ-loss is significantly less sensitive to regularization hyperparameter tuning than z-loss.
翻译:大语言模型的预训练不仅成本高昂,而且容易受到某些训练不稳定性影响。一种在训练后期使用较大学习率时经常出现的不稳定性是输出逻辑值发散。目前最广泛使用的缓解策略——z损失——仅针对问题的表象而非根本原因。本文从输出嵌入几何结构的角度分析该不稳定性,并确定其成因。基于此,我们提出输出嵌入中心化作为新的缓解策略,并证明其能有效抑制输出逻辑值发散。OEC可通过两种方式实现:作为确定性操作的μ中心化,或作为正则化方法的μ损失。实验表明,两种变体在训练稳定性和学习率敏感性方面均优于z损失。特别是当z损失失效时,它们仍能确保使用较大学习率时训练收敛。此外,我们发现μ损失对正则化超参数调整的敏感性显著低于z损失。