Large Language Models (LLMs) have achieved remarkable success, yet recent findings reveal that their deeper layers often contribute minimally and can be pruned without affecting overall performance. While some view this as an opportunity for model compression, we identify it as a training shortfall rooted in the widespread use of Pre-Layer Normalization (Pre-LN). We demonstrate that Pre-LN, commonly employed in models like GPT and LLaMA, leads to diminished gradient norms in its deeper layers, reducing their effectiveness. In contrast, Post-Layer Normalization (Post-LN) preserves larger gradient norms in deeper layers but suffers from vanishing gradients in earlier layers. To address this, we introduce Mix-LN, a novel normalization technique that combines the strengths of Pre-LN and Post-LN within the same model. Mix-LN applies Post-LN to the earlier layers and Pre-LN to the deeper layers, ensuring more uniform gradients across layers. This allows all parts of the network--both shallow and deep layers--to contribute effectively to training. Extensive experiments with various model sizes from 70M to 7B demonstrate that Mix-LN consistently outperforms both Pre-LN and Post-LN, promoting more balanced, healthier gradient norms throughout the network, and enhancing the overall quality of LLM pre-training. Furthermore, we demonstrate that models pre-trained with Mix-LN learn better compared to those using Pre-LN or Post-LN during supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), highlighting the critical importance of high-quality deep layers. By effectively addressing the inefficiencies of deep layers in current LLMs, Mix-LN unlocks their potential, enhancing model capacity without increasing model size. Our code is available at https://github.com/pixeli99/MixLN.
翻译:大型语言模型(LLMs)已取得显著成功,但近期研究发现其深层网络往往贡献甚微,甚至可以在不影响整体性能的情况下被剪枝。尽管部分观点将此视为模型压缩的机遇,但我们认为这源于广泛使用的预层归一化(Pre-LN)所导致的训练缺陷。我们证明,在GPT和LLaMA等模型中普遍采用的Pre-LN会导致深层梯度范数衰减,从而削弱其有效性。相比之下,后层归一化(Post-LN)虽能在深层保持较大的梯度范数,却在前层面临梯度消失问题。为此,我们提出Mix-LN——一种新颖的归一化技术,它能在同一模型中融合Pre-LN与Post-LN的优势。Mix-LN在前层应用Post-LN,在深层应用Pre-LN,从而确保各层梯度分布更均匀。这使得网络的所有部分(包括浅层与深层)都能在训练中有效发挥作用。在70M至7B不同规模模型上的大量实验表明,Mix-LN始终优于Pre-LN和Post-LN,能在整个网络中促进更均衡、更健康的梯度范数,并提升LLM预训练的整体质量。此外,我们证明采用Mix-LN预训练的模型,在监督微调(SFT)和基于人类反馈的强化学习(RLHF)阶段,比使用Pre-LN或Post-LN的模型具有更强的学习能力,这凸显了高质量深层网络的关键重要性。通过有效解决当前LLMs中深层网络的低效问题,Mix-LN释放了其潜力,在不增加模型规模的前提下增强了模型能力。我们的代码公开于https://github.com/pixeli99/MixLN。