The majority of parameters in neural networks are naturally represented as matrices. However, most commonly used optimizers treat these matrix parameters as flattened vectors during optimization, potentially overlooking their inherent structural properties. Recently, an optimizer called Muon has been proposed, specifically designed to optimize matrix-structured parameters. Extensive empirical evidence shows that Muon can significantly outperform traditional optimizers when training neural networks. Nonetheless, the theoretical understanding of Muon's convergence behavior and the reasons behind its superior performance remain limited. In this work, we present a comprehensive convergence rate analysis of Muon and its comparison with Gradient Descent (GD). We characterize the conditions under which Muon can outperform GD. Our theoretical results reveal that Muon can benefit from the low-rank structure of Hessian matrices, a phenomenon widely observed in practical neural network training. Our experimental results support and corroborate the theoretical findings.
翻译:神经网络中的绝大多数参数天然以矩阵形式表示。然而,大多数常用优化器在优化过程中将这些矩阵参数视为扁平化向量,可能忽略了其固有的结构特性。最近,一种名为Muon的优化器被提出,专门用于优化矩阵结构参数。大量实验证据表明,在训练神经网络时,Muon能显著优于传统优化器。尽管如此,关于Muon收敛行为及其优越性能背后原因的理论理解仍然有限。在本工作中,我们提出了对Muon收敛率的全面分析,并将其与梯度下降法(Gradient Descent, GD)进行了比较。我们刻画了Muon能够超越GD的条件。理论结果表明,Muon能从Hessian矩阵的低秩结构中受益,这一现象在实践中的神经网络训练中被广泛观察到。我们的实验结果支持并验证了这些理论发现。