On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer

A central question in modern deep learning is how to design optimizers whose behavior remains stable as the network width $w$ increases. We address this question by interpreting several widely used neural-network optimizers, including \textrm{AdamW} and \textrm{Muon}, as instances of steepest descent under matrix operator norms. This perspective links optimizer geometry with the Lipschitz structure of the network forward map, and enables width-independent control of both Lipschitz and smoothness constants. However, steepest-descent rules induced by standard $p \to q$ operator norms lack layerwise composability and therefore cannot provide width-independent bounds in deep architectures. We overcome this limitation by introducing a family of mean-normalized operator norms, denoted $\pmean \to \qmean$, that admit layerwise composability, yield width-independent smoothness bounds, and give rise to practical optimizers such as \emph{rescaled} \textrm{AdamW}, row normalization, and column normalization. The resulting learning rate width-aware scaling rules recover $μ$P scaling~\cite{yang2021tensor} as a special case and provide a principled mechanism for cross-width learning-rate transfer across a broad class of optimizers. We further show that \textrm{Muon} can suffer an $\mathcal{O}(\sqrt{w})$ worst-case growth in the smoothness constant, whereas a new family of row-normalized optimizers we propose achieves width-independent smoothness guarantees. Based on the observations, we propose MOGA (Matrix Operator Geometry Aware), a width-aware optimizer based only on row/column-wise normalization that enables stable learning-rate transfer across model widths. Large-scale pre-training on GPT-2 and LLaMA shows that MOGA, especially with row normalization, is competitive with Muon while being notably faster in large-token and low-loss regimes.

翻译：现代深度学习中的一个核心问题是如何设计优化器，使其行为在网络宽度 $w$ 增加时保持稳定。我们通过将几种广泛使用的神经网络优化器（包括 \textrm{AdamW} 和 \textrm{Muon}）解释为矩阵算子范数下的最速下降实例来探讨此问题。这一视角将优化器的几何结构与网络前向映射的 Lipschitz 结构联系起来，并实现了对 Lipschitz 常数和平滑性常数的宽度无关控制。然而，由标准 $p \to q$ 算子范数诱导的最速下降规则缺乏逐层可组合性，因此无法在深度架构中提供宽度无关的界。我们通过引入一族均值归一化算子范数（记为 $\pmean \to \qmean$）来克服这一限制，该范数允许逐层可组合，产生宽度无关的平滑性界，并催生出实用的优化器，例如 \emph{重缩放} \textrm{AdamW}、行归一化和列归一化。由此产生的学习率宽度感知缩放规则将 $μ$P 缩放~\cite{yang2021tensor} 恢复为一个特例，并为一大类优化器提供了跨宽度学习率迁移的原理性机制。我们进一步证明，\textrm{Muon} 的平滑性常数在最坏情况下可能遭受 $\mathcal{O}(\sqrt{w})$ 的增长，而我们提出的新一族行归一化优化器则实现了宽度无关的平滑性保证。基于这些观察，我们提出了 MOGA（矩阵算子几何感知），这是一种仅基于行/列归一化的宽度感知优化器，能够实现跨模型宽度的稳定学习率迁移。在 GPT-2 和 LLaMA 上进行的大规模预训练表明，MOGA（尤其是采用行归一化时）与 Muon 性能相当，同时在大令牌和低损失场景下速度显著更快。