LayerNorm and RMSNorm impose fundamentally different geometric constraints on their outputs - and this difference has a precise, quantifiable consequence for model complexity. We prove that LayerNorm's mean-centering step, by confining data to a linear hyperplane (through the origin), reduces the Local Learning Coefficient (LLC) of the subsequent weight matrix by exactly $m/2$ (where $m$ is its output dimension); RMSNorm's projection onto a sphere preserves the LLC entirely. This reduction is structurally guaranteed before any training begins, determined by data manifold geometry alone. The underlying condition is a geometric threshold: for the codimension-one manifolds we study, the LLC drop is binary -- any non-zero curvature, regardless of sign or magnitude, is sufficient to preserve the LLC, while only affinely flat manifolds cause the drop. At finite sample sizes this threshold acquires a smooth crossover whose width depends on how much of the data distribution actually experiences the curvature, not merely on whether curvature exists somewhere. We verify both predictions experimentally with controlled single-layer scaling experiments using the wrLLC framework. We further show that Softmax simplex data introduces a "smuggled bias" that activates the same $m/2$ LLC drop when paired with an explicit downstream bias, proved via the affine symmetry extension of the main theorem and confirmed empirically.
翻译:层归一化(LayerNorm)与均方根归一化(RMSNorm)对其输出施加了根本不同的几何约束——这一差异对模型复杂度具有精确且可量化的影响。我们证明,层归一化通过将数据约束至一个穿过原点的线性超平面,使其均值中心化步骤将后续权重矩阵的局部学习系数(LLC)严格降低$m/2$(其中$m$为其输出维度);而均方根归一化向球面的投影则完全保持LLC不变。这种降低在训练开始前即受结构保证,完全由数据流形几何决定。其基础条件是几何阈值:对于所研究的余维一流形,LLC下降是二元的——任何非零曲率(无论符号或大小)都足以维持LLC,只有仿射平坦流形会导致下降。在有限样本量下,该阈值呈现平滑过渡,其宽度取决于数据分布实际经历曲率的部分,而不仅仅是曲率是否存在。我们通过wrLLC框架进行受控单层缩放实验,验证了这两个预测。进一步,我们证明Softmax单纯形数据会引入一个“隐蔽偏置”,当与显式下游偏置结合时,会激活相同的$m/2$ LLC下降,此结论由主定理的仿射对称扩展所证明,并经实验证实。