Orthogonalized-update optimizers such as Muon improve training of matrix-valued parameters, but existing extensions mostly act either after orthogonalization by rescaling updates or before it with heavier whitening-based preconditioners. We introduce {\method}, a lightweight family of pre-orthogonalization equilibration schemes for Muon in three forms: two-sided row/column normalization (RC), row normalization (R), and column normalization (C). These variants rebalance the momentum matrix before finite-step Newton--Schulz using row/column squared-norm statistics and only $\mathcal{O}(m+n)$ auxiliary state. We show that finite-step orthogonalization is governed by input spectral properties, especially stable rank and condition number, and that row/column normalization is a zeroth-order whitening surrogate that removes marginal scale mismatch. For the hidden matrix weights targeted by {\method}, the row-normalized variant R is the natural default and preserves the $\widetilde{\mathcal{O}}(T^{-1/4})$ stationarity guarantee of Muon-type methods. In LLaMA2 pretraining on C4, the default R variant consistently outperforms Muon on 130M and 350M models, yielding faster convergence and lower validation perplexity.
翻译:正交化更新优化器(如Muon)提升了矩阵值参数的训练效果,但现有扩展方法大多在正交化之后通过缩放更新进行操作,或在正交化之前采用更复杂的白化预处理。我们提出{\method},这是一种针对Muon的轻量级预正交化均衡方案家族,包含三种形式:双边行/列归一化(RC)、行归一化(R)和列归一化(C)。这些变体利用行/列平方范数统计量,在有限步Newton–Schulz迭代前对动量矩阵进行重新均衡,仅需$\mathcal{O}(m+n)$的辅助状态。我们证明,有限步正交化受输入谱特性(尤其是稳定秩和条件数)的支配,而行/列归一化作为一种零阶白化替代方法,可消除边际尺度失配。针对{\method}所面向的隐藏矩阵权重,行归一化变体R作为自然默认选项,保留了Muon类方法的$\widetilde{\mathcal{O}}(T^{-1/4})$平稳性保证。在C4数据集上的LLaMA2预训练实验中,默认R变体在130M和350M模型上始终优于Muon,实现了更快的收敛速度和更低的验证困惑度。