PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective

The ever-growing scale of deep learning models and training data underscores the critical importance of efficient optimization methods. While preconditioned gradient methods such as Adam and AdamW are the de facto optimizers for training neural networks and large language models, structure-aware preconditioned optimizers like Shampoo and Muon, which utilize the matrix structure of gradients, have demonstrated promising evidence of faster convergence. In this paper, we introduce a unifying framework for analyzing "matrix-aware" preconditioned methods, which not only sheds light on the effectiveness of Muon and related optimizers but also leads to a class of new structure-aware preconditioned methods. A key contribution of this framework is its precise distinction between preconditioning strategies that treat neural network weights as vectors (addressing curvature anisotropy) versus those that consider their matrix structure (addressing gradient anisotropy). This perspective provides new insights into several empirical phenomena in language model pre-training, including Adam's training instabilities, Muon's accelerated convergence, and the necessity of learning rate warmup for Adam. Building upon this framework, we introduce PolarGrad, a new class of preconditioned optimization methods based on the polar decomposition of matrix-valued gradients. As a special instance, PolarGrad includes Muon with updates scaled by the nuclear norm of the gradients. We provide numerical implementations of these methods, leveraging efficient numerical polar decomposition algorithms for enhanced convergence. Our extensive evaluations across diverse matrix optimization problems and language model pre-training tasks demonstrate that PolarGrad outperforms both Adam and Muon.

翻译：深度学习模型与训练数据规模的持续增长，凸显了高效优化方法的至关重要性。尽管Adam和AdamW等预条件梯度方法已成为训练神经网络和大语言模型的事实标准优化器，但如Shampoo和Muon这类利用梯度矩阵结构的结构感知预条件优化器，已展现出更快收敛速度的有力证据。本文提出一个用于分析“矩阵感知”预条件方法的统一框架，该框架不仅揭示了Muon及相关优化器的有效性，还导出了一类新的结构感知预条件方法。该框架的一个关键贡献在于精确区分了将神经网络权重视为向量（处理曲率各向异性）与考虑其矩阵结构（处理梯度各向异性）的两种预条件策略。这一视角为语言模型预训练中的若干经验现象提供了新的见解，包括Adam的训练不稳定性、Muon的加速收敛以及Adam所需的学习率预热机制。基于此框架，我们提出了PolarGrad——一类基于矩阵值梯度极分解的新型预条件优化方法。作为一个特例，PolarGrad包含了以梯度核范数缩放更新步长的Muon。我们提供了这些方法的数值实现，利用高效的数值极分解算法以提升收敛速度。我们在多种矩阵优化问题及语言模型预训练任务上的广泛评估表明，PolarGrad在性能上均优于Adam和Muon。