Second-order methods can converge much faster than first-order methods by incorporating second-order derivates or statistics, but they are far less prevalent in deep learning due to their computational inefficiency. To handle this, many of the existing solutions focus on reducing the size of the matrix to be inverted. However, it is still needed to perform the inverse operator in each iteration. In this paper, we present a fast natural gradient descent (FNGD) method, which only requires computing the inverse during the first epoch. Firstly, we reformulate the gradient preconditioning formula in the natural gradient descent (NGD) as a weighted sum of per-sample gradients using the Sherman-Morrison-Woodbury formula. Building upon this, to avoid the iterative inverse operation involved in computing coefficients, the weighted coefficients are shared across epochs without affecting the empirical performance. FNGD approximates the NGD as a fixed-coefficient weighted sum, akin to the average sum in first-order methods. Consequently, the computational complexity of FNGD can approach that of first-order methods. To demonstrate the efficiency of the proposed FNGD, we perform empirical evaluations on image classification and machine translation tasks. For training ResNet-18 on the CIFAR-100 dataset, FNGD can achieve a speedup of 2.05$\times$ compared with KFAC. For training Transformer on Multi30K, FNGD outperforms AdamW by 24 BLEU score while requiring almost the same training time.
翻译:二阶方法通过利用二阶导数或统计信息能比一阶方法更快收敛,但由于计算效率低下,它们在深度学习中的普及度远低于一阶方法。针对这一问题,现有解决方案多聚焦于减小需逆矩阵的规模,但每次迭代仍需执行逆运算。本文提出一种快速自然梯度下降(FNGD)方法,该方法仅在首个训练周期计算矩阵逆。首先,利用Sherman-Morrison-Woodbury公式将自然梯度下降(NGD)中的梯度预处理公式重构为逐样本梯度的加权求和。在此基础上,为避免计算系数所需的迭代逆运算,我们在不影响经验性能的前提下跨训练周期共享加权系数。FNGD将NGD近似为固定系数的加权求和,其形式类似于一阶方法中的平均求和,从而使得FNGD的计算复杂度可接近一阶方法。为验证FNGD的效率,我们在图像分类与机器翻译任务上进行了实证评估。在CIFAR-100数据集上训练ResNet-18时,FNGD相比KFAC实现了2.05倍的加速比;在Multi30K数据集上训练Transformer时,FNGD在几乎保持相同训练时间的情况下,比AdamW高出24个BLEU分数。