SOAP: Improving and Stabilizing Shampoo using Adam

There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks include additional hyperparameters and computational overhead when compared to Adam, which only updates running averages of first- and second-moment quantities. This work establishes a formal connection between Shampoo (implemented with the 1/2 power) and Adafactor -- a memory-efficient approximation of Adam -- showing that Shampoo is equivalent to running Adafactor in the eigenbasis of Shampoo's preconditioner. This insight leads to the design of a simpler and computationally efficient algorithm: $\textbf{S}$hampo$\textbf{O}$ with $\textbf{A}$dam in the $\textbf{P}$reconditioner's eigenbasis (SOAP). With regards to improving Shampoo's computational efficiency, the most straightforward approach would be to simply compute Shampoo's eigendecomposition less frequently. Unfortunately, as our empirical results show, this leads to performance degradation that worsens with this frequency. SOAP mitigates this degradation by continually updating the running average of the second moment, just as Adam does, but in the current (slowly changing) coordinate basis. Furthermore, since SOAP is equivalent to running Adam in a rotated space, it introduces only one additional hyperparameter (the preconditioning frequency) compared to Adam. We empirically evaluate SOAP on language model pre-training with 360m and 660m sized models. In the large batch regime, SOAP reduces the number of iterations by over 40% and wall clock time by over 35% compared to AdamW, with approximately 20% improvements in both metrics compared to Shampoo. An implementation of SOAP is available at https://github.com/nikhilvyas/SOAP.

翻译：越来越多的证据表明，在深度学习优化任务中，高阶预处理方法Shampoo相比Adam具有更优性能。然而，与仅需更新一阶矩和二阶矩滑动平均值的Adam相比，Shampoo存在额外超参数和计算开销的缺点。本研究建立了Shampoo（采用1/2次幂实现）与内存高效版Adam近似算法Adafactor之间的形式化关联，证明Shampoo等价于在自身预处理矩阵特征基上运行Adafactor。这一洞见催生了一种更简洁且计算高效的算法设计：在预处理矩阵特征基中运行的$\textbf{S}$hampo$\textbf{O}$ with $\textbf{A}$dam（SOAP）。针对提升Shampoo计算效率的需求，最直接的方法是降低特征分解的计算频率。但实验表明，这种做法会导致性能下降，且下降程度与计算频率成反比。SOAP通过在当前（缓慢变化的）坐标基中持续更新二阶矩滑动平均值（如同Adam的更新机制）来缓解这种性能衰减。此外，由于SOAP等价于在旋转空间中运行Adam，相较于Adam仅引入一个额外超参数（预处理频率）。我们在360M和660M参数规模的语言模型预训练任务中对SOAP进行实证评估。在大批次训练场景下，相比AdamW，SOAP减少超过40%的迭代次数和超过35%的墙上时钟时间；相比Shampoo，两项指标均有约20%的提升。SOAP的实现代码已发布于https://github.com/nikhilvyas/SOAP。