There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks include additional hyperparameters and computational overhead when compared to Adam, which only updates running averages of first- and second-moment quantities. This work establishes a formal connection between Shampoo (implemented with the 1/2 power) and Adafactor -- a memory-efficient approximation of Adam -- showing that Shampoo is equivalent to running Adafactor in the eigenbasis of Shampoo's preconditioner. This insight leads to the design of a simpler and computationally efficient algorithm: $\textbf{S}$hampo$\textbf{O}$ with $\textbf{A}$dam in the $\textbf{P}$reconditioner's eigenbasis (SOAP). With regards to improving Shampoo's computational efficiency, the most straightforward approach would be to simply compute Shampoo's eigendecomposition less frequently. Unfortunately, as our empirical results show, this leads to performance degradation that worsens with this frequency. SOAP mitigates this degradation by continually updating the running average of the second moment, just as Adam does, but in the current (slowly changing) coordinate basis. Furthermore, since SOAP is equivalent to running Adam in a rotated space, it introduces only one additional hyperparameter (the preconditioning frequency) compared to Adam. We empirically evaluate SOAP on language model pre-training with 360m and 660m sized models. In the large batch regime, SOAP reduces the number of iterations by over 40% and wall clock time by over 35% compared to AdamW, with approximately 20% improvements in both metrics compared to Shampoo. An implementation of SOAP is available at https://github.com/nikhilvyas/SOAP.
翻译:越来越多的证据表明,在深度学习优化任务中,高阶预条件方法Shampoo相较于Adam具有更优性能。然而,与仅需更新一阶矩和二阶矩滑动平均的Adam相比,Shampoo存在额外超参数和计算开销的缺陷。本研究建立了Shampoo(采用1/2次幂实现)与内存高效版Adam近似算法Adafactor之间的形式化关联,证明Shampoo等价于在预条件子特征基中运行Adafactor。这一洞见催生了一种更简洁且计算高效的算法设计:在预条件子特征基中运行Adam的Shampoo算法(SOAP)。针对提升Shampoo计算效率的最直接方案是降低其特征分解计算频率,但实证结果表明这会导致性能下降,且下降程度随频率降低而加剧。SOAP通过在当前(缓慢变化的)坐标基中持续更新二阶矩滑动平均(如同Adam的更新机制)来缓解此性能衰减。此外,由于SOAP等价于在旋转空间中运行Adam,相较于Adam仅引入一个额外超参数(预条件频率)。我们在360M和660M参数规模的语言模型预训练任务中对SOAP进行实证评估。在大批量训练场景下,与AdamW相比,SOAP减少超过40%的迭代次数和超过35%的墙钟时间;与Shampoo相比,两项指标均有约20%的提升。SOAP的实现代码已发布于https://github.com/nikhilvyas/SOAP。