Leveraging second-order information at the scale of deep networks is one of the main lines of approach for improving the performance of current optimizers for deep learning. Yet, existing approaches for accurate full-matrix preconditioning, such as Full-Matrix Adagrad (GGT) or Matrix-Free Approximate Curvature (M-FAC) suffer from massive storage costs when applied even to medium-scale models, as they must store a sliding window of gradients, whose memory requirements are multiplicative in the model dimension. In this paper, we address this issue via an efficient and simple-to-implement error-feedback technique that can be applied to compress preconditioners by up to two orders of magnitude in practice, without loss of convergence. Specifically, our approach compresses the gradient information via sparsification or low-rank compression \emph{before} it is fed into the preconditioner, feeding the compression error back into future iterations. Extensive experiments on deep neural networks for vision show that this approach can compress full-matrix preconditioners by up to two orders of magnitude without impact on accuracy, effectively removing the memory overhead of full-matrix preconditioning for implementations of full-matrix Adagrad (GGT) and natural gradient (M-FAC). Our code is available at https://github.com/IST-DASLab/EFCP.
翻译:利用深度网络规模的二阶信息是改进当前深度学习优化器性能的主要途径之一。然而,现有的精确全矩阵预处理方法(如全矩阵Adagrad (GGT) 或无矩阵近似曲率 (M-FAC))即使应用于中等规模模型时,也会因需要存储梯度的滑动窗口而面临巨大的存储开销——其内存需求随模型维度呈乘性增长。本文通过一种高效且易于实现的误差反馈技术解决这一问题,该技术可在实践中将预处理器压缩高达两个数量级且不影响收敛性。具体而言,我们的方法在梯度输入预处理器之前,通过稀疏化或低秩压缩对其进行压缩,并将压缩误差反馈至后续迭代中。在视觉深度神经网络上的大量实验表明,该方法可将全矩阵预处理器压缩高达两个数量级且不影响精度,从而有效消除全矩阵Adagrad (GGT) 和自然梯度 (M-FAC) 实现中全矩阵预处理的内存开销。我们的代码开源于 https://github.com/IST-DASLab/EFCP。