通过适当权重衰减调整实现鲁棒的逐层缩放规则 (Robust Layerwise Scaling Rules by Proper Weight Decay Tuning)

Empirical scaling laws prescribe how to allocate parameters, data, and compute, while maximal-update parameterization ($\mu$P) enables learning-rate transfer across widths by equalizing early-time update magnitudes. However, in modern scale-invariant architectures, training quickly enters an optimizer-governed steady state where normalization layers create backward scale sensitivity and the effective learning rate becomes width dependent, degrading $\mu$P transfer. We address this by introducing a weight-decay scaling rule for AdamW that preserves sublayer gain across widths. Empirically, the singular-value spectrum of each matrix parameter scales in norm as $\sqrt{\eta/\lambda}$ with an approximately invariant shape; under width scaling $d$, we observe that the top singular value scales approximately as $\sqrt{\eta/\lambda}\cdot d^{0.75}$. Combining this observation with the $\mu$P learning-rate rule $\eta_2\propto d^{-1}$ for matrix-like parameters implies an empirical weight-decay scaling rule $\lambda_2\propto \sqrt{d}$ that approximately keeps sublayer gains width invariant. Together with vector-like parameters trained at $\eta_1=\Theta_d(1)$ and $\lambda_1=0$, this yields \emph{zero-shot} transfer of both learning rate and weight decay from proxy to target widths, removing per-width sweeps. We validate the rule on LLaMA-style Transformers and in a minimal synthetic setting, and we provide a simple diagnostic, matching top singular values, to check sublayer-gain invariance. Our results extend $\mu$P beyond the near-init regime by explicitly controlling steady-state scales set by the optimizer, offering a practical recipe for width-robust hyperparameter transfer under AdamW.

翻译：经验缩放法则规定了如何分配参数、数据与计算资源，而最大更新参数化（$\mu$P）通过均衡早期更新幅度实现了跨宽度学习率的可迁移性。然而，在现代尺度不变架构中，训练会快速进入优化器主导的稳态阶段，此时归一化层会产生反向尺度敏感性，使得有效学习率与宽度相关，从而破坏 $\mu$P 的迁移效果。我们通过为 AdamW 引入一种权重衰减缩放规则来解决此问题，该规则能保持跨宽度的子层增益不变。实验表明，每个矩阵参数的奇异值谱在范数上按 $\sqrt{\eta/\lambda}$ 缩放且形状近似不变；在宽度缩放 $d$ 倍时，我们观察到最大奇异值近似按 $\sqrt{\eta/\lambda}\cdot d^{0.75}$ 缩放。将此观测结果与矩阵类参数的 $\mu$P 学习率规则 $\eta_2\propto d^{-1}$ 相结合，可推导出经验性权重衰减缩放规则 $\lambda_2\propto \sqrt{d}$，该规则能近似保持子层增益的宽度不变性。结合向量类参数在 $\eta_1=\Theta_d(1)$ 和 $\lambda_1=0$ 条件下的训练，该方案实现了从代理宽度到目标宽度的学习率与权重衰减的\emph{零样本}迁移，无需针对每个宽度进行参数扫描。我们在 LLaMA 风格 Transformer 模型及最小化合成场景中验证了该规则，并提供了一种通过匹配最大奇异值的简易诊断方法以检验子层增益不变性。本研究通过显式控制优化器设定的稳态尺度，将 $\mu$P 的应用范围扩展至近初始化阶段之外，为 AdamW 优化器下实现宽度鲁棒的超参数迁移提供了实用方案。