Large language model (LLM) scaling is hitting a wall. Widening models yields diminishing returns, and extending context length does not improve fundamental expressivity. In contrast, depth scaling offers theoretically superior expressivity, yet current Transformer architectures struggle to train reliably at extreme depths. We revisit the Post-LayerNorm (Post-LN) formulation, whose instability at scale caused its replacement by Pre-LN in modern LLMs. We show that the central failure mode of Post-LN arises from the ResNet-style residual pathway, which introduces gradient vanishing in deep networks. We present Keel, a Post-LN Transformer that replaces this residual path with a Highway-style connection. This modification preserves the gradient flow through the residual branch, preventing signal vanishing from the top layers to the bottom. Unlike prior methods, Keel enables stable training at extreme depths without requiring specialized initialization or complex optimization tricks. Keel trains robustly at depths exceeding 1000 layers and consistently improves perplexity and depth-scaling characteristics over Pre-LN. These findings indicate that Post-LN, when paired with a Highway-style connection, provides a simple and effective foundation for building deeply scalable LLMs, opening the possibility for future infinite-depth architectures.
翻译:大型语言模型(LLM)的规模化扩展正遭遇瓶颈。拓宽模型带来的收益递减,而扩展上下文长度亦无法提升其根本的表达能力。相比之下,深度扩展理论上具有更优的表达潜力,然而当前的Transformer架构在极端深度下难以实现可靠训练。本文重新审视了后层归一化(Post-LN)结构——因其在大规模训练中的不稳定性,在现代LLM中已被前层归一化(Pre-LN)所取代。我们指出,Post-LN的核心失效模式源于其类ResNet的残差连接路径,该路径在深度网络中会导致梯度消失。我们提出Keel,一种采用Post-LN的Transformer,其使用类Highway的连接方式替代原有残差路径。这一修改保留了通过残差分支的梯度流,防止信号从顶层向底层消失。与先前方法不同,Keel能够在无需特殊初始化或复杂优化技巧的情况下,实现极端深度的稳定训练。Keel在超过1000层的深度下仍能稳健训练,并在困惑度与深度扩展特性上持续优于Pre-LN。这些结果表明,当与类Highway连接结合时,Post-LN为构建深度可扩展的LLM提供了一个简单而有效的基础,为未来无限深度架构的探索开辟了可能。