The success of Large Language Models (LLMs) hinges on the stable training of deep Transformer architectures. A critical design choice is the placement of normalization layers, leading to a fundamental trade-off: the ``PreNorm'' architecture ensures training stability at the cost of potential performance degradation in deep models, while the ``PostNorm'' architecture offers strong performance but suffers from severe training instability. In this work, we propose SpanNorm, a novel technique designed to resolve this dilemma by integrating the strengths of both paradigms. Structurally, SpanNorm establishes a clean residual connection that spans the entire transformer block to stabilize signal propagation, while employing a PostNorm-style computation that normalizes the aggregated output to enhance model performance. We provide a theoretical analysis demonstrating that SpanNorm, combined with a principled scaling strategy, maintains bounded signal variance throughout the network, preventing the gradient issues that plague PostNorm models, and also alleviating the representation collapse of PreNorm. Empirically, SpanNorm consistently outperforms standard normalization schemes in both dense and Mixture-of-Experts (MoE) scenarios, paving the way for more powerful and stable Transformer architectures.
翻译:大型语言模型(LLM)的成功依赖于深度Transformer架构的稳定训练。归一化层的放置位置是一个关键的设计选择,这导致了一个根本性的权衡:“PreNorm”架构确保了训练稳定性,但代价是深度模型中可能存在的性能下降;而“PostNorm”架构提供了强大的性能,却存在严重的训练不稳定性。在本工作中,我们提出了SpanNorm,这是一种旨在通过整合两种范式优势来解决此困境的新技术。在结构上,SpanNorm建立了一个跨越整个Transformer块的简洁残差连接以稳定信号传播,同时采用PostNorm风格的计算来归一化聚合输出以提升模型性能。我们提供了理论分析,证明SpanNorm结合一种有原则的缩放策略,能在整个网络中保持有界的信号方差,从而防止困扰PostNorm模型的梯度问题,并缓解PreNorm的表示坍缩。实证结果表明,在密集和混合专家(MoE)两种场景下,SpanNorm均持续优于标准归一化方案,为构建更强大且稳定的Transformer架构铺平了道路。