In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models (LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling (LNS), which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Across a wide range of model sizes (130M to 7B), our experiments show that LNS consistently outperforms previous normalization and scaling techniques in enhancing LLM pre-training performance. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training. Our code is available at \href{https://github.com/lmsdss/LayerNorm-Scaling}{LayerNorm-Scaling}.
翻译:本文提出"深度诅咒"这一概念,用以阐释并解决现代大语言模型(LLMs)中近半数深层网络层效能低于预期的现象。我们首先在Llama、Mistral、DeepSeek、Qwen等主流大语言模型系列中验证了该现象的普遍存在性。通过理论与实证分析,我们发现导致LLMs深层失效的根本原因在于广泛采用的预层归一化(Pre-LN)机制。虽然Pre-LN能稳定Transformer架构LLMs的训练过程,但其输出方差会随模型深度呈指数级增长,这导致深层Transformer模块的导数趋近于单位矩阵,从而对训练过程的贡献微乎其微。为克服这一训练缺陷,我们提出层归一化缩放(LNS)方法,通过深度平方根的倒数对层归一化输出方差进行缩放。这种简易改进有效抑制了深层Transformer层的输出方差爆炸问题,显著提升了其贡献度。在130M至7B参数规模的广泛实验中,LNS在增强LLM预训练性能方面持续优于现有归一化与缩放技术。此外,这种改进能无缝迁移至监督微调阶段。所有性能提升均可归因于层归一化缩放机制有效增强了深层网络在训练过程中的贡献度。代码已开源:\href{https://github.com/lmsdss/LayerNorm-Scaling}{LayerNorm-Scaling}。