The growing adoption of spectrum-aware matrix-valued optimizers such as Muon and Shampoo in deep learning motivates a systematic study of their generalization properties and, in particular, when they might outperform competitive algorithms. We approach this question by introducing appropriate simplifying abstractions as follows: First, we use imbalanced data as a testbed. Second, we study the canonical form of such optimizers, which is Spectral Gradient Descent (SpecGD) -- each update step is $UV^T$ where $U\Sigma V^T$ is the truncated SVD of the gradient. Third, within this framework we identify a canonical setting for which we precisely quantify when SpecGD outperforms vanilla Euclidean GD. For a Gaussian mixture data model and both linear and bilinear models, we show that unlike GD, which prioritizes learning dominant principal components of the data first, SpecGD learns all principal components of the data at equal rates. We demonstrate how this translates to a growing gap in balanced accuracy favoring SpecGD early in training and further show that the gap remains consistent even when the GD counterpart uses adaptive step-sizes via normalization. By extending the analysis to deep linear models, we show that depth amplifies these effects. We empirically verify our theoretical findings on a variety of imbalanced datasets. Our experiments compare practical variants of spectral methods, like Muon and Shampoo, against their Euclidean counterparts and Adam. The results validate our findings that these spectral optimizers achieve superior generalization by promoting a more balanced learning of the data's underlying components.
翻译:随着Muon和Shampoo等谱感知矩阵值优化器在深度学习中的日益普及,系统研究其泛化特性——尤其是它们在何种情况下可能优于竞争算法——变得至关重要。我们通过引入以下适当的简化抽象来探讨这一问题:首先,我们使用不平衡数据作为测试平台。其次,我们研究此类优化器的规范形式,即谱梯度下降(SpecGD)——每个更新步骤为$UV^T$,其中$U\Sigma V^T$是梯度的截断奇异值分解。第三,在此框架内,我们确定了一个规范设置,并精确量化了SpecGD何时优于标准欧几里得梯度下降(GD)。针对高斯混合数据模型以及线性和双线性模型,我们证明:与GD优先学习数据的主导主成分不同,SpecGD以相同的速率学习数据的所有主成分。我们展示了这如何转化为训练早期SpecGD在平衡准确率上不断扩大的优势,并进一步证明即使GD对应版本通过归一化使用自适应步长,该差距依然保持稳定。通过将分析扩展到深度线性模型,我们表明深度会放大这些效应。我们在多种不平衡数据集上实证验证了理论发现。实验比较了谱方法(如Muon和Shampoo)的实际变体与其欧几里得对应方法以及Adam。结果证实了我们的发现:这些谱优化器通过促进对数据底层成分更均衡的学习,实现了更优的泛化性能。