Stochastic Frank-Wolfe is a classical optimization method for solving constrained optimization problems. On the other hand, recent optimizers such as Lion and Muon have gained quite significant popularity in deep learning. In this work, building on recent initiatives, we provide a unifying perspective by interpreting these seemingly disparate methods through the lens of Stochastic Frank-Wolfe. Specifically, we show that Lion and Muon with weight decay can be viewed as special instances of a Stochastic Frank-Wolfe, and we establish their convergence guarantees in terms of the Frank-Wolfe gap, a standard stationarity measure in non-convex optimization for Frank-Wolfe methods. We further find that convergence to this gap implies convergence to a KKT point of the original problem under a norm constraint for Lion and Muon. Moreover, motivated by recent empirical findings that stochastic gradients in modern machine learning tasks often exhibit heavy-tailed distributions, we extend Stochastic Frank-Wolfe to settings with heavy-tailed noise by developing two robust variants with strong theoretical guarantees that hold for general compact convex sets without the need for a large batch size, filling the gap in the literature on Stochastic Frank-Wolfe for non-convex optimization. Our contributions in the later part of this work, in turn, yield new variants of Lion and Muon, that better accommodate heavy-tailed gradient noise, thereby enhancing their practical scope.
翻译:随机Frank-Wolfe是解决约束优化问题的经典优化方法。另一方面,近年来诸如Lion和Muon等优化器在深度学习领域获得了相当显著的关注。本工作中,基于近期研究倡议,我们通过随机Frank-Wolfe的视角为这些看似不同的方法提供了统一的理论框架。具体而言,我们证明带有权重衰减的Lion和Muon可视为随机Frank-Wolfe的特殊实例,并基于Frank-Wolfe间隙(非凸优化中Frank-Wolfe方法的标准平稳性度量)建立了它们的收敛性保证。我们进一步发现,对于Lion和Muon而言,向该间隙的收敛意味着在范数约束下向原问题KKT点的收敛。此外,受现代机器学习任务中随机梯度常呈现重尾分布这一实证发现的启发,我们将随机Frank-Wolfe扩展至重尾噪声场景,开发了两种具有强理论保证的鲁棒变体。这些变体适用于一般紧凸集且无需大批次尺寸,填补了非凸优化中随机Frank-Wolfe方法的研究空白。本工作后半部分的贡献进而催生了Lion和Muon的新变体,这些变体能更好地适应重尾梯度噪声,从而拓展了其实际应用范围。