The Muon optimizer, a matrix-structured algorithm that leverages spectral orthogonalization of gradients, is a milestone in the pretraining of large language models. However, the underlying mechanisms of Muon -- particularly the role of gradient orthogonalization -- remain poorly understood, with very few works providing end-to-end analyses that rigorously explain its advantages in concrete applications. We take a step by studying the effectiveness of a simplified variant of Muon through two case studies: matrix factorization, and in-context learning of linear transformers. For both problems, we prove that simplified Muon converges linearly with iteration complexities independent of the relevant condition number, provably outperforming gradient descent and Adam. Our analysis reveals that the Muon dynamics decouple into a collection of independent scalar sequences in the spectral domain, each exhibiting similar convergence behavior. Our theory formalizes the preconditioning effect induced by spectral orthogonalization, offering insight into Muon's effectiveness in these matrix optimization problems and potentially beyond.
翻译:Muon优化器是一种利用梯度谱正交化的矩阵结构算法,是大语言模型预训练领域的里程碑。然而,Muon的内在机制——尤其是梯度正交化的作用——至今仍未得到充分理解,目前极少有研究提供能够严格解释其在具体应用中优势的端到端分析。我们通过两个案例研究迈出了探索的一步:矩阵分解与线性Transformer的上下文学习。针对这两个问题,我们证明了简化版Muon能够以线性速率收敛,且其迭代复杂度与相关条件数无关,在理论上优于梯度下降法和Adam。我们的分析表明,Muon的动态过程在谱域中解耦为一系列独立的标量序列,每个序列都表现出相似的收敛行为。该理论形式化地揭示了谱正交化所诱导的预条件效应,为理解Muon在这些矩阵优化问题乃至更广泛场景中的有效性提供了理论依据。