Shampoo is an online and stochastic optimization algorithm belonging to the AdaGrad family of methods for training neural networks. It constructs a block-diagonal preconditioner where each block consists of a coarse Kronecker product approximation to full-matrix AdaGrad for each parameter of the neural network. In this work, we provide a complete description of the algorithm as well as the performance optimizations that our implementation leverages to train deep networks at-scale in PyTorch. Our implementation enables fast multi-GPU distributed data-parallel training by distributing the memory and computation associated with blocks of each parameter via PyTorch's DTensor data structure and performing an AllGather primitive on the computed search directions at each iteration. This major performance enhancement enables us to achieve at most a 10% performance reduction in per-step wall-clock time compared against standard diagonal-scaling-based adaptive gradient methods. We validate our implementation by performing an ablation study on training ImageNet ResNet50, demonstrating Shampoo's superiority over standard training recipes with minimal hyperparameter tuning.
翻译:Shampoo 是一种在线随机优化算法,属于 AdaGrad 系列方法,用于训练神经网络。该算法构建了一个块对角预条件矩阵,其中每个块由对神经网络各参数的全矩阵 AdaGrad 进行粗粒度 Kronecker 积近似得到。本文提供了该算法的完整描述,以及我们在 PyTorch 中实现的大规模深度网络训练所依赖的性能优化方法。我们的实现通过 PyTorch 的 DTensor 数据结构分布存储与计算各参数块,并在每次迭代中对计算出的搜索方向执行 AllGather 原语,从而实现了快速的多 GPU 分布式数据并行训练。这一关键性能提升使得我们的方法在每步壁钟时间上,与基于标准对角缩放的自适应梯度方法相比,性能损失最多仅为 10%。我们通过在 ImageNet ResNet50 训练上进行消融研究验证了该实现,结果表明 Shampoo 在超参数微调需求极小的情况下优于标准训练方案。