Shampoo is one of the leading approximate second-order optimizers: a variant of it has won the MLCommons AlgoPerf competition, and it has been shown to produce models with lower activation outliers that are easier to compress. Yet, applying Shampoo currently comes at the cost of significant computational slowdown, due to its expensive internal operations. In this paper, we take a significant step to address this shortcoming by proposing \method (for \textbf{D}istributed \textbf{A}ccelerated \textbf{SH}ampoo), a faster implementation of Distributed Shampoo based on two main new techniques: First, we show that preconditioner blocks can be stacked into 3D tensors to significantly improve GPU utilization; second, we introduce the Newton-DB iteration and the Chebyshev polynomial approximations as novel and faster approaches for computing the inverse matrix roots required by Shampoo. Along with these algorithmic contributions, we provide a first in-depth analysis of how matrix scaling critically affects Shampoo convergence. On the practical side, our GPU-aware implementation achieves up to $4.83\times$ faster optimizer steps compared to the well-optimized Distributed Shampoo, while Newton-DB attains the lowest validation perplexity per iteration among all tested methods. Our code is available at https://github.com/IST-DASLab/DASH.
翻译:Shampoo是当前领先的近似二阶优化器之一:其变体曾赢得MLCommons AlgoPerf竞赛,并被证明能生成具有更低激活异常值、更易压缩的模型。然而,目前应用Shampoo会带来显著的计算速度下降,这源于其昂贵的内部运算。本文通过提出\method(即\textbf{D}istributed \textbf{A}ccelerated \textbf{SH}ampoo)来显著改善这一缺陷,该方法是基于两项核心新技术的分布式Shampoo加速实现:首先,我们证明预条件子块可堆叠为三维张量以显著提升GPU利用率;其次,我们引入Newton-DB迭代法和切比雪夫多项式逼近作为计算Shampoo所需逆矩阵根的新型快速方法。除了这些算法贡献,我们首次深入分析了矩阵缩放如何关键性地影响Shampoo收敛性。在实际应用层面,我们的GPU感知实现相比优化良好的分布式Shampoo实现了最高$4.83\times$的优化步速提升,而Newton-DB在所有测试方法中获得了每次迭代的最低验证困惑度。代码发布于https://github.com/IST-DASLab/DASH。