Muon is an increasingly widely used optimizer that replaces a gradient $G=USV^\top$ with its polar factor $UV^\top$, thereby flattening the singular spectrum. However, full flattening discards singular-value information that may matter for adaptation. We introduce Muon$^p$, a Muon-style optimizer that instead uses fractional spectral-power updates $US^pV^\top$ for rational $p\in(0,1)$, interpolating between Muon and gradient descent. To make it practical, we prove that fractional spectral powers cannot be computed by any fixed univariate polynomial iteration, and furthermore derive low-degree odd bivariate recurrences that approximate $US^pV^\top$ using only matrix multiplications, preserving Muon's matrix-multiplication-only structure and compute complexity. We show that Muon$^p$ maximizes the linear improvement in loss under the Schatten $q$-norm for $q=1+\frac{1}{p}$. Empirically, Muon$^p$ is especially effective for finetuning: on billion-scale models, Muon$^p$ improves validation perplexity and downstream task performance. We further analyze when Muon$^p$ is less suitable, through the lens of spectral geometry. Our results reveal important insights on when preserving the singular spectrum can bring significant gains, and introduce a principled way to achieve them.
翻译:Muon是一种日益广泛使用的优化器,通过将梯度$G=USV^\top$替换为其极因子$UV^\top$来压平奇异谱。然而,完全压平丢弃了对自适应可能重要的奇异值信息。我们提出Muon$^p$,一种Muon风格的优化器,它使用分数谱幂更新$US^pV^\top$(有理数$p\in(0,1)$),在Muon与梯度下降之间插值。为使其可行,我们证明分数谱幂无法通过任何固定的单变量多项式迭代计算,进而推导出低次奇次双变量递推关系来近似$US^pV^\top$,该关系仅需矩阵乘法,保留了Muon仅含矩阵乘法的结构和计算复杂度。我们证明在Schatten $q$-范数下($q=1+\frac{1}{p}$),Muon$^p$最大化损失的线性改善。实验表明,Muon$^p$在微调中尤其有效:在十亿级模型上,Muon$^p$提升了验证困惑度和下游任务性能。我们进一步通过谱几何视角分析了Muon$^p$适用性较弱的情形。研究结果揭示了保留奇异谱何时能带来显著增益的重要洞见,并提出了一种实现这一目标的原理性方法。