Leveraging second-order information at the scale of deep networks is one of the main lines of approach for improving the performance of current optimizers for deep learning. Yet, existing approaches for accurate full-matrix preconditioning, such as Full-Matrix Adagrad (GGT) or Matrix-Free Approximate Curvature (M-FAC) suffer from massive storage costs when applied even to medium-scale models, as they must store a sliding window of gradients, whose memory requirements are multiplicative in the model dimension. In this paper, we address this issue via an efficient and simple-to-implement error-feedback technique that can be applied to compress preconditioners by up to two orders of magnitude in practice, without loss of convergence. Specifically, our approach compresses the gradient information via sparsification or low-rank compression \emph{before} it is fed into the preconditioner, feeding the compression error back into future iterations. Extensive experiments on deep neural networks for vision show that this approach can compress full-matrix preconditioners by up to two orders of magnitude without impact on accuracy, effectively removing the memory overhead of full-matrix preconditioning for implementations of full-matrix Adagrad (GGT) and natural gradient (M-FAC). Our code is available at https://github.com/IST-DASLab/EFCP.
翻译:利用深度网络规模下的二阶信息是提升当前深度学习优化器性能的主要研究方向之一。然而,现有精确全矩阵预条件方法(如全矩阵Adagrad(GGT)或无矩阵近似曲率(M-FAC))即使应用于中等规模模型时也会遭遇巨大的存储开销,因为它们必须存储梯度滑动窗口,其内存需求随模型维度呈乘性增长。本文通过一种高效且易于实现的误差反馈技术解决该问题,该技术可在实践中将预条件器压缩多达两个数量级且不损失收敛性。具体而言,我们的方法在将梯度信息输入预条件器之前,通过稀疏化或低秩压缩对其进行压缩,并将压缩误差反馈至后续迭代。在视觉深度神经网络上的大量实验表明,该方法可将全矩阵预条件器压缩多达两个数量级而不影响精度,从而有效消除全矩阵Adagrad(GGT)和自然梯度(M-FAC)实现中全矩阵预条件的内存开销。我们的代码开源在https://github.com/IST-DASLab/EFCP。