Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN

Large Language Models (LLMs) have achieved remarkable success, yet recent findings reveal that their deeper layers often contribute minimally and can be pruned without affecting overall performance. While some view this as an opportunity for model compression, we identify it as a training shortfall rooted in the widespread use of Pre-Layer Normalization (Pre-LN). We demonstrate that Pre-LN, commonly employed in models like GPT and LLaMA, leads to diminished gradient norms in its deeper layers, reducing their effectiveness. In contrast, Post-Layer Normalization (Post-LN) preserves larger gradient norms in deeper layers but suffers from vanishing gradients in earlier layers. To address this, we introduce Mix-LN, a novel normalization technique that combines the strengths of Pre-LN and Post-LN within the same model. Mix-LN applies Post-LN to the earlier layers and Pre-LN to the deeper layers, ensuring more uniform gradients across layers. This allows all parts of the network--both shallow and deep layers--to contribute effectively to training. Extensive experiments with various model sizes from 70M to 7B demonstrate that Mix-LN consistently outperforms both Pre-LN and Post-LN, promoting more balanced, healthier gradient norms throughout the network, and enhancing the overall quality of LLM pre-training. Furthermore, we demonstrate that models pre-trained with Mix-LN learn better compared to those using Pre-LN or Post-LN during supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), highlighting the critical importance of high-quality deep layers. By effectively addressing the inefficiencies of deep layers in current LLMs, Mix-LN unlocks their potential, enhancing model capacity without increasing model size. Our code is available at https://github.com/pixeli99/MixLN.

翻译：大型语言模型（LLMs）已取得显著成功，但近期研究发现其深层网络往往贡献甚微，甚至可以在不影响整体性能的情况下被剪枝。尽管部分观点将此视为模型压缩的机遇，但我们认为这源于广泛使用的预层归一化（Pre-LN）所导致的训练缺陷。我们证明，在GPT和LLaMA等模型中普遍采用的Pre-LN会导致深层梯度范数衰减，从而削弱其有效性。相比之下，后层归一化（Post-LN）虽能在深层保持较大的梯度范数，却在前层面临梯度消失问题。为此，我们提出Mix-LN——一种新颖的归一化技术，它能在同一模型中融合Pre-LN与Post-LN的优势。Mix-LN在前层应用Post-LN，在深层应用Pre-LN，从而确保各层梯度分布更均匀。这使得网络的所有部分（包括浅层与深层）都能在训练中有效发挥作用。在70M至7B不同规模模型上的大量实验表明，Mix-LN始终优于Pre-LN和Post-LN，能在整个网络中促进更均衡、更健康的梯度范数，并提升LLM预训练的整体质量。此外，我们证明采用Mix-LN预训练的模型，在监督微调（SFT）和基于人类反馈的强化学习（RLHF）阶段，比使用Pre-LN或Post-LN的模型具有更强的学习能力，这凸显了高质量深层网络的关键重要性。通过有效解决当前LLMs中深层网络的低效问题，Mix-LN释放了其潜力，在不增加模型规模的前提下增强了模型能力。我们的代码公开于https://github.com/pixeli99/MixLN。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日