While adaptive gradient methods are the workhorse of modern machine learning, sign-based optimization algorithms such as Lion and Muon have recently demonstrated superior empirical performance over AdamW in training large language models (LLM). However, a theoretical understanding of why sign-based updates outperform variance-adapted methods remains elusive. In this paper, we aim to bridge the gap between theory and practice through the lens of heavy-tailed gradient noise, a phenomenon frequently observed in language modeling tasks. Theoretically, we introduce a novel generalized heavy-tailed noise condition that captures the behavior of LLMs more accurately than standard finite variance assumptions. Under this noise model, we establish sharp convergence rates of SignSGD and Lion for generalized smooth function classes, matching or surpassing previous best-known bounds. Furthermore, we extend our analysis to Muon and Muonlight, providing what is, to our knowledge, the first rigorous analysis of matrix optimization under heavy-tailed stochasticity. These results offer a strong theoretical justification for the empirical superiority of sign-based optimizers, showcasing that they are naturally suited to handle the noisy gradients associated with heavy tails. Empirically, LLM pretraining experiments validate our theoretical insights and confirm that our proposed noise models are well-aligned with practice.
翻译:尽管自适应梯度方法是现代机器学习的核心工具,但诸如Lion和Muon等基于符号的优化算法最近在训练大语言模型(LLM)方面展现出优于AdamW的实证性能。然而,关于符号更新为何能超越方差自适应方法的理论理解仍然不足。本文旨在通过重尾梯度噪声这一在语言建模任务中频繁出现的现象,弥合理论与实践之间的差距。理论上,我们引入了一种新颖的广义重尾噪声条件,它比标准的有限方差假设更准确地捕捉了LLM的行为。在此噪声模型下,我们为广义光滑函数类建立了SignSGD和Lion的尖锐收敛率,达到或超越了先前已知的最佳界。此外,我们将分析扩展到Muon和Muonlight,据我们所知,这提供了重尾随机性下矩阵优化的首次严格分析。这些结果为基于符号优化器的实证优越性提供了强有力的理论依据,表明它们天然适合处理与重尾相关的噪声梯度。实证方面,LLM预训练实验验证了我们的理论见解,并证实所提出的噪声模型与实践高度吻合。