Shampoo and its efficient variant, SOAP, employ structured second-moment estimations and have shown strong performance for training neural networks (NNs). In practice, however, Shampoo typically requires step-size grafting with Adam to be competitive, and SOAP mitigates this by applying Adam in Shampoo's eigenbasis -- at the cost of additional memory overhead from Adam in both methods. Prior analyses have largely relied on the Frobenius norm to motivate these estimation schemes. We instead recast their estimation procedures as covariance estimation under Kullback-Leibler (KL) divergence minimization, revealing a previously overlooked theoretical limitation and motivating principled redesigns. Building on this perspective, we develop $\textbf{KL-Shampoo}$ and $\textbf{KL-SOAP}$, practical schemes that match or exceed the performance of Shampoo and SOAP in NN pre-training while achieving SOAP-level per-iteration runtime. Notably, KL-Shampoo does not rely on Adam to attain competitive performance, eliminating the memory overhead introduced by Adam. Across our experiments, KL-Shampoo consistently outperforms SOAP, Shampoo, and even KL-SOAP, establishing the KL-based approach as a promising foundation for designing structured methods in NN optimization. An implementation of KL-Shampoo/KL-SOAP is available at https://github.com/yorkerlin/KL-Methods
翻译:Shampoo及其高效变体SOAP采用结构化的二阶矩估计方法,在神经网络训练中展现出强劲性能。然而在实际应用中,Shampoo通常需要与Adam进行步长嫁接才能保持竞争力,而SOAP通过在Shampoo的特征基中应用Adam来缓解此问题——这两种方法均需承担Adam带来的额外内存开销。既往分析主要依赖Frobenius范数来论证这些估计方案。本文另辟蹊径,将其估计过程重新阐释为Kullback-Leibler散度最小化框架下的协方差估计,由此揭示先前被忽视的理论局限,并推动基于原理的重新设计。基于此视角,我们开发了$\textbf{KL-Shampoo}$与$\textbf{KL-SOAP}$两种实用方案,在神经网络预训练中达到或超越Shampoo与SOAP性能的同时,实现了SOAP级别的单次迭代运行时间。值得注意的是,KL-Shampoo无需依赖Adam即可获得竞争优势,彻底消除了Adam引入的内存开销。在所有实验中,KL-Shampoo始终优于SOAP、Shampoo乃至KL-SOAP,确立了基于KL散度的方法作为神经网络优化中结构化方法设计的有前景基础。KL-Shampoo/KL-SOAP的实现代码已发布于https://github.com/yorkerlin/KL-Methods