Residual architectures are ubiquitous in deep learning, but they suffer from a subtle structural limitation: the norm of the residual stream can grow rapidly with depth. As a result, updates from later layers become small relative to the accumulated residual state. This reduces their impact on the representation and limits the benefits of scaling models in depth. To address this, we introduce NAG, a norm-agnostic residual architecture that separates magnitude from directional information in the residual stream, preserving meaningful layer contributions throughout depth and preventing later updates from being systematically suppressed by residual-norm growth. Importantly, NAG introduces only a negligible number of additional parameters and relies on simple operations that are easily kernel-fusible, preserving training efficiency in practice. We show that this architecture outperforms baseline Transformers, with gains that increase substantially as depth grows, enabling effective training of much deeper models. The norm-agnostic formulation also leads to an interpretable Mixture-of-Depths (MoD) mechanism that adaptively skips both attention and MLP layers. Beyond serving as a post-training accuracy-compute tradeoff, this mechanism can be used as a pretraining-time scaling strategy: under iso-FLOP training, compute saved by reducing per-token forward-pass cost can be reinvested into training on more tokens while keeping the total parameter count and KV-cache budget fixed. In our experiments, moderate Mixture-of-Depths rates of approximately 20%-25% match full-depth baseline performance under equal training compute while substantially reducing the number of executed layer parameters and forward-pass FLOPs. These results identify sparsity in depth as a new scaling axis for fixed-compute training, enabling very deep yet FLOP-efficient models.
翻译:摘要:残差架构在深度学习中无处不在,但存在一个微妙的局限性:残差流的范数会随深度快速增长。这导致后续层的更新量相对于累积的残差状态而言变得微小,从而削弱其对表征的贡献,并限制了模型深度缩放带来的收益。为解决此问题,我们提出NAG——一种范数无关残差架构,将残差流中的幅度与方向信息分离,确保各层在整个深度范围内保持有意义的贡献,并防止后续更新因残差范数增长而系统性受限。重要的是,NAG仅引入可忽略的额外参数,且依赖易于核融合的简单运算,在实践中保持训练效率。实验表明,该架构性能优于基线Transformer,且优势随深度增加显著提升,从而实现对更深模型的高效训练。范数无关的公式化设计还催生了一种可解释的深度混合机制(MoD),该机制能自适应地跳过注意力层和MLP层。该机制不仅可作为训练后的精度-计算量权衡方案,还可作为预训练阶段的缩放策略:在等FLOP训练条件下,通过降低每词元前向传播成本所节省的计算资源,可重新投入更多词元的训练,同时保持总参数量与KV缓存预算不变。在我们的实验中,约20%-25%的适度深度混合率可在相等训练计算量下达到与全深度基线相当的性能,同时大幅减少实际执行的层级参数量与前向传播FLOP。这些结果揭示了深度稀疏性可作为固定计算量训练的新缩放维度,从而支持极深但FLOP高效的模型训练。