Spectral optimizers such as Muon have recently shown strong empirical performance in large-scale language model training, but the source and extent of their advantage remain poorly understood. We study this question through the linear associative memory problem, a tractable model for factual recall in transformer-based models. In particular, we go beyond orthogonal embeddings and consider Gaussian inputs and outputs, which allows the number of stored associations to greatly exceed the embedding dimension. Our main result sharply characterizes the recovery rates of one step of Muon, SGD, and Newton's method on the logistic regression loss under a power law frequency distribution. We show that the storage capacity of Muon significantly exceeds that of SGD, and even matches Newton's method while only using first-order information. Moreover, Muon saturates at a larger critical batch size. We further analyze the multi-step dynamics under a thresholded gradient approximation and show that Muon achieves a substantially faster initial recovery rate than SGD, while both methods eventually converge to the information-theoretic limit at comparable speeds. Experiments on synthetic tasks validate the predicted scaling laws. Our analysis provides a quantitative understanding of the signal amplification of spectral preconditioners and lays the groundwork for establishing scaling laws across more practical language modeling tasks and optimizers.
翻译:诸如Muon等频谱优化器近期在大规模语言模型训练中展现出强劲的实证性能,但其优势来源与程度仍未被充分理解。我们通过线性联想记忆问题(一种可表征Transformer模型事实性回忆的可解模型)来研究该问题。具体而言,我们超越正交嵌入假设,考虑高斯输入与输出设定,允许存储的关联数量远超嵌入维度。主要结果精确刻画了在幂律频率分布逻辑回归损失函数下,单步Muon、随机梯度下降(SGD)及牛顿法的恢复率。研究表明,Muon的存储容量显著超过SGD,且仅使用一阶信息即可匹配牛顿法。此外,Muon在更大临界批量下达到饱和。我们进一步在阈值梯度近似下分析多步动力学,证明Muon的初始恢复速度远快于SGD,而两者最终以相近速度收敛至信息论极限。基于合成任务的实验验证了预测的标度律。该分析为频谱预条件子的信号放大效应提供了定量理解,并为在更实际的语言建模任务与优化器中建立标度律奠定基础。