Adaptive regularization methods that exploit more than the diagonal entries exhibit state of the art performance for many tasks, but can be prohibitive in terms of memory and running time. We find the spectra of the Kronecker-factored gradient covariance matrix in deep learning (DL) training tasks are concentrated on a small leading eigenspace that changes throughout training, motivating a low-rank sketching approach. We describe a generic method for reducing memory and compute requirements of maintaining a matrix preconditioner using the Frequent Directions (FD) sketch. While previous approaches have explored applying FD for second-order optimization, we present a novel analysis which allows efficient interpolation between resource requirements and the degradation in regret guarantees with rank $k$: in the online convex optimization (OCO) setting over dimension $d$, we match full-matrix $d^2$ memory regret using only $dk$ memory up to additive error in the bottom $d-k$ eigenvalues of the gradient covariance. Further, we show extensions of our work to Shampoo, resulting in a method competitive in quality with Shampoo and Adam, yet requiring only sub-linear memory for tracking second moments.
翻译:摘要:利用对角元素以外的自适应正则化方法在许多任务中展现了最先进的性能,但在内存和运行时间方面可能代价高昂。我们发现深度学习训练任务中克罗内克因子化梯度协方差矩阵的谱集中在训练过程中变化的较小领先特征空间上,这激发了低秩草图方法的动机。我们描述了一种通用方法,通过使用频繁方向(FD)草图来减少维护矩阵预处理器所需的内存和计算资源。虽然之前的研究已探索将FD应用于二阶优化,但我们提出了一种新颖分析,允许在资源需求与遗憾保证退化之间有效插值,其中秩为\(k\):在维度\(d\)的在线凸优化设置中,我们仅使用\(dk\)内存即可匹配全矩阵的\(d^2\)内存遗憾,直至梯度协方差底部\(d-k\)个特征值的加性误差。此外,我们将工作扩展到Shampoo算法,形成一种在质量上与Shampoo和Adam相竞争的方法,同时仅需次线性内存来跟踪二阶矩。