Shampoo and its efficient variant, SOAP, employ structured second-moment estimations and have shown strong performance for training neural networks (NNs). In practice, however, Shampoo typically requires step-size grafting with Adam to be competitive, and SOAP mitigates this by applying Adam in Shampoo's eigenbasis -- at the cost of additional memory overhead from Adam in both methods. Prior analyses have largely relied on the Frobenius norm to motivate these estimation schemes. We instead recast their estimation procedures as covariance estimation under Kullback-Leibler (KL) divergence minimization, revealing a previously overlooked theoretical limitation and motivating principled redesigns. Building on this perspective, we develop $\textbf{KL-Shampoo}$ and $\textbf{KL-SOAP}$, practical schemes that match or exceed the performance of Shampoo and SOAP in NN pre-training while achieving SOAP-level per-iteration runtime. Notably, KL-Shampoo does not rely on Adam to attain competitive performance, eliminating the memory overhead introduced by Adam. Across our experiments, KL-Shampoo consistently outperforms SOAP, Shampoo, and even KL-SOAP, establishing the KL-based approach as a promising foundation for designing structured methods in NN optimization. An implementation of KL-Shampoo/KL-SOAP is available at https://github.com/yorkerlin/KL-Methods
翻译:摘要:Shampoo及其高效变体SOAP采用结构化二阶矩估计方法,在训练神经网络(NN)中展现出优异性能。然而实践中,Shampoo需配合Adam进行步长嫁接才能达到可比效果,而SOAP通过在Shampoo的特征基中应用Adam来解决此问题——但代价是两种方法均因引入Adam而增加额外内存开销。现有分析多基于Frobenius范数来推导这些估计方案。本文转而将它们的估计过程重新诠释为基于Kullback-Leibler(KL)散度最小化的协方差估计,揭示了先前被忽视的理论局限性,并推动进行理论驱动的设计改进。基于这一视角,我们开发了$\textbf{KL-Shampoo}$和$\textbf{KL-SOAP}$两种实用方案,在NN预训练中达到或超越Shampoo/SOAP性能的同时,保持了SOAP级别的单步迭代耗时。值得注意的是,KL-Shampoo无需依赖Adam即可获得竞争性表现,从而消除了Adam引入的内存开销。在全部实验中,KL-Shampoo始终优于SOAP、Shampoo乃至KL-SOAP,确立了基于KL的方法作为NN优化中结构化方法设计的基础性范式。KL-Shampoo/KL-SOAP的实现代码已开源至https://github.com/yorkerlin/KL-Methods