In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models (LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling (LNS), which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Across a wide range of model sizes (130M to 7B), our experiments show that LNS consistently outperforms previous normalization and scaling techniques in enhancing LLM pre-training performance. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training. Our code is available at \href{https://github.com/lmsdss/LayerNorm-Scaling}{LayerNorm-Scaling}.
翻译:本文提出“深度诅咒”这一概念,用以阐释并解决现代大语言模型(LLMs)中近半数层表现不及预期的现象。我们首先在Llama、Mistral、DeepSeek、Qwen等主流大语言模型系列中广泛验证了该现象的存在。通过理论与实证分析,我们发现导致大语言模型深层失效的根本原因在于广泛采用的预层归一化(Pre-LN)机制。虽然Pre-LN能稳定Transformer大语言模型的训练过程,但其输出方差会随模型深度呈指数级增长,这导致深层Transformer块的导数趋近单位矩阵,从而对训练几乎不产生贡献。为克服这一训练缺陷,我们提出层归一化缩放(LNS)方法,通过深度平方根的倒数对层归一化输出方差进行缩放。这一简单改进有效抑制了深层Transformer层的输出方差爆炸问题,提升了其贡献度。在130M至7B参数规模的广泛实验中,LNS在增强大语言模型预训练性能方面持续优于现有归一化与缩放技术。此外,该改进能无缝迁移至监督微调阶段。所有这些提升均可归因于层归一化缩放使深层在训练过程中发挥了更有效的作用。代码已开源:\href{https://github.com/lmsdss/LayerNorm-Scaling}{LayerNorm-Scaling}。