This paper studies AdamW-style Shampoo, an effective variant of the classical Shampoo that won the external tuning track of the AlgoPerf neural network training competition. Our analysis unifies one-sided and two-sided preconditioning. When the exponents of the two preconditioners sum to $1/2$, we establish the convergence rate $\frac{1}{K}\sum_{k=1}^KE\left[||\nabla f(X_k)||_*\right]\leq O(\frac{\sqrt{m+n}C}{K^{1/4}})$, where $K$ represents the number of iterations, $(m,n)$ denotes the dimensions of the matrix-valued parameters, and $C$ matches the constant appearing in the optimal convergence rate of SGD. Theoretically, the nuclear norm and Frobenius norm satisfy $||\nabla f(X)||_F\leq ||\nabla f(X)||_*\leq \sqrt{\min\{m,n\}}||\nabla f(X)||_F$, which suggests that our convergence rate is analogous to the optimal $\frac{1}{K}\sum_{k=1}^KE\left[||\nabla f(X_k)||_F\right]\leq O(\frac{C}{K^{1/4}})$ convergence rate of SGD in the ideal case where $||\nabla f(X)||_*= Θ(\sqrt{\min\{m,n\}})||\nabla f(X)||_F$ and $m$ and $n$ are of comparable magnitude. Then, we extend our analysis to settings where the preconditioning exponents do not sum to 1/2, and establish convergence with an explicit but more involved rate.
翻译:本文研究AdamW风格Shampoo——经典Shampoo算法的一种高效变体,该算法曾赢得AlgoPerf神经网络训练竞赛的外部调优赛道。我们的分析统一了单侧与双侧预条件方法。当两个预条件子的指数之和为$1/2$时,我们建立了收敛速率$\frac{1}{K}\sum_{k=1}^KE\left[||\nabla f(X_k)||_*\right]\leq O(\frac{\sqrt{m+n}C}{K^{1/4}})$,其中$K$表示迭代次数,$(m,n)$表示矩阵值参数的维度,$C$与随机梯度下降法最优收敛速率中的常数一致。理论上,核范数和Frobenius范数满足$||\nabla f(X)||_F\leq ||\nabla f(X)||_*\leq \sqrt{\min\{m,n\}}||\nabla f(X)||_F$,这表明在$||\nabla f(X)||_*= Θ(\sqrt{\min\{m,n\}})||\nabla f(X)||_F$且$m$与$n$量级相当的理想情况下,我们的收敛速率类似于随机梯度下降法的最优速率$\frac{1}{K}\sum_{k=1}^KE\left[||\nabla f(X_k)||_F\right]\leq O(\frac{C}{K^{1/4}})$。随后,我们将分析推广至预条件子指数之和不等于1/2的情形,并建立了具有显式但更复杂形式的收敛性。