Leveraging second-order information at the scale of deep networks is one of the main lines of approach for improving the performance of current optimizers for deep learning. Yet, existing approaches for accurate full-matrix preconditioning, such as Full-Matrix Adagrad (GGT) or Matrix-Free Approximate Curvature (M-FAC) suffer from massive storage costs when applied even to medium-scale models, as they must store a sliding window of gradients, whose memory requirements are multiplicative in the model dimension. In this paper, we address this issue via an efficient and simple-to-implement error-feedback technique that can be applied to compress preconditioners by up to two orders of magnitude in practice, without loss of convergence. Specifically, our approach compresses the gradient information via sparsification or low-rank compression \emph{before} it is fed into the preconditioner, feeding the compression error back into future iterations. Extensive experiments on deep neural networks for vision show that this approach can compress full-matrix preconditioners by up to two orders of magnitude without impact on accuracy, effectively removing the memory overhead of full-matrix preconditioning for implementations of full-matrix Adagrad (GGT) and natural gradient (M-FAC). Our code is available at https://github.com/IST-DASLab/EFCP.
翻译:利用深度网络规模的二阶信息是改善当前深度学习优化器性能的主要途径之一。然而,现有的精确全矩阵预条件方法(如全矩阵Adagrad (GGT) 或无矩阵近似曲率 (M-FAC))即便应用于中等规模模型时也会面临巨大的存储开销——这些方法需要存储滑动窗口内的梯度,其内存需求与模型维度呈乘积关系。本文通过一种高效且易于实现的误差反馈技术解决此问题,该技术可在实际应用中将预条件子压缩至两个数量级而不损失收敛性。具体而言,我们的方法在梯度输入预条件子之前通过稀疏化或低秩压缩对其进行压缩,并将压缩误差反馈至后续迭代中。针对视觉任务的深度神经网络的大量实验表明,该方法可将全矩阵预条件子压缩至两个数量级而不影响精度,从而有效消除全矩阵Adagrad (GGT) 和自然梯度 (M-FAC) 实现中全矩阵预条件的内存开销。我们的代码已开源至 https://github.com/IST-DASLab/EFCP。