Gated Removal of Normalization in Transformers Enables Stable Training and Efficient Inference

Normalization is widely viewed as essential for stabilizing Transformer training. We revisit this assumption for pre-norm Transformers and ask to what extent sample-dependent normalization is needed inside Transformer blocks. We introduce TaperNorm, a drop-in replacement for RMSNorm/LayerNorm that behaves exactly like the standard normalizer early in training and then smoothly tapers to a learned sample-independent linear/affine map. A single global gate is held at $g{=}1$ during gate warmup, used to calibrate the scaling branch via EMAs, and then cosine-decayed to $g{=}0$, at which point per-token statistics vanish and the resulting fixed scalings can be folded into adjacent linear projections. Our theoretical and empirical results isolate scale anchoring as the key role played by output normalization: as a (near) $0$-homogeneous map it removes radial gradients at the output, whereas without such an anchor cross-entropy encourages unbounded logit growth (``logit chasing''). We further show that a simple fixed-target auxiliary loss on the pre-logit residual-stream scale provides an explicit alternative anchor and can aid removal of the final normalization layer. Empirically, TaperNorm matches normalized baselines under identical setups while eliminating per-token statistics and enabling these layers to be folded into adjacent linear projections at inference. On an efficiency microbenchmark, folding internal scalings yields up to $1.22\times$ higher throughput in last-token logits mode. These results take a step towards norm-free Transformers while identifying the special role output normalization plays.

翻译：归一化被广泛视为稳定Transformer训练的关键技术。本文针对预归一化Transformer重新审视这一假设，探讨Transformer模块内部对样本依赖型归一化的实际需求程度。我们提出TaperNorm——一种可即插即替换RMSNorm/LayerNorm的方案，其在训练初期完全保持标准归一化行为，随后平滑过渡至学习得到的样本无关线性/仿射映射。通过单一全局门控机制实现：门控预热阶段保持$g{=}1$以通过指数移动平均校准缩放分支，随后按余弦衰减至$g{=}0$，此时逐令牌统计量完全消失，所得固定缩放参数可折叠至相邻线性投影层。理论与实验结果表明，尺度锚定是输出归一化发挥的核心作用：作为（近似）$0$齐次映射，它能消除输出端的径向梯度；若无此锚定机制，交叉熵损失会驱动逻辑值无限增长（"逻辑值追逐"现象）。我们进一步证明，通过对预逻辑值残差流尺度施加简单固定目标的辅助损失，可提供显式替代锚定方案，有助于移除最终归一化层。实验表明，在相同配置下TaperNorm性能匹配归一化基线，同时消除了逐令牌统计量，使这些层在推理时可折叠至相邻线性投影。在效率微基准测试中，折叠内部缩放机制使末端令牌逻辑值生成模式的吞吐量最高提升$1.22\times$。本研究在迈向无归一化Transformer的道路上迈出重要一步，同时揭示了输出归一化所承担的特殊角色。