Muon replaces a matrix gradient $G=UΣV^\top$ by its polar factor $UV^\top$. This keeps the singular directions selected by the gradient, but makes the update spectrum flat. We study the optimization bias created by this operation. Under explicit alignment assumptions, we prove that the polar update is the one-step entropy-maximizing choice among bounded updates that use the gradient singular directions and do not adapt to the current weight spectrum. In an underdetermined regression model, we derive exact singular-value dynamics for continuous-time Muon and identify a measurement-dependent condition under which the normalized spectrum moves toward equal nonzero singular values. This geometry also rules out a common low-rank interpretation: at fixed Frobenius norm, Muon's distinguished state has a flat spectrum, whereas nuclear-norm minimization favors spectral concentration. Controlled matrix-sensing experiments separate the effect from simple gradient rescaling, show that norm-matched gradient descent does not reproduce Muon, and recover the predicted flattening trend across broad ablations. In small NanoGPT pretraining, Muon preserves stable rank, has a broad learning-rate plateau, and improves validation loss relative to AdamW; in a matched small-ViT control, the ranking reverses. The resulting picture is regime-dependent: Muon is not universally superior, but its flat-spectrum bias can help when many spectral directions need to remain active.
翻译:Muon通过其极分解因子 $UV^\top$ 替换矩阵梯度 $G=UΣV^\top$。这一操作保留了梯度选取的奇异方向,但使更新谱变得平坦。我们研究了该操作带来的优化偏差。在显式对齐假设下,我们证明:在利用梯度奇异方向且不自适应当前权重谱的有界更新中,极更新是单步熵最大化的选择。在欠定回归模型中,我们推导了连续时间Muon的精确奇异值动力学,并识别出一个依赖于测量的条件,该条件下归一化谱趋向于相等的非零奇异值。这一几何属性也排除了一种常见的低秩解释:在固定Frobenius范数下,Muon的显著状态具有平坦谱,而核范数最小化则偏好谱集中。受控矩阵感知实验将这一效应与简单梯度缩放区分开来,表明范数匹配的梯度下降无法复现Muon,并在广泛的消融实验中恢复了预测的平坦化趋势。在小规模NanoGPT预训练中,Muon保持了稳定秩,具有较宽的学习率平台,并相比AdamW改善了验证损失;在匹配的小ViT控制实验中,排名顺序发生反转。由此形成的图像依赖于具体场景:Muon并非普遍优越,但其平坦谱偏好在需要保持多个谱方向活跃时可能有所帮助。