As a second-order method, the Natural Gradient Descent (NGD) has the ability to accelerate training of neural networks. However, due to the prohibitive computational and memory costs of computing and inverting the Fisher Information Matrix (FIM), efficient approximations are necessary to make NGD scalable to Deep Neural Networks (DNNs). Many such approximations have been attempted. The most sophisticated of these is KFAC, which approximates the FIM as a block-diagonal matrix, where each block corresponds to a layer of the neural network. By doing so, KFAC ignores the interactions between different layers. In this work, we investigate the interest of restoring some low-frequency interactions between the layers by means of two-level methods. Inspired from domain decomposition, several two-level corrections to KFAC using different coarse spaces are proposed and assessed. The obtained results show that incorporating the layer interactions in this fashion does not really improve the performance of KFAC. This suggests that it is safe to discard the off-diagonal blocks of the FIM, since the block-diagonal approach is sufficiently robust, accurate and economical in computation time.
翻译:作为二阶方法,自然梯度下降(NGD)具有加速神经网络训练的能力。然而,由于计算和存储Fisher信息矩阵(FIM)及其逆矩阵的高昂代价,需要高效的近似方法才能使NGD可扩展至深度神经网络(DNN)。已有多种此类近似尝试,其中最为精密的是KFAC,它将FIM近似为块对角矩阵,每个对角块对应神经网络的一个层。通过这种方式,KFAC忽略了不同层之间的相互作用。在本工作中,我们通过两级方法研究恢复层间低频相互作用的可行性。受领域分解启发,我们提出并评估了多种使用不同粗网格修正的两级KFAC方法。结果表明,以这种方式纳入层间相互作用并不能真正提升KFAC的性能。这提示我们,忽略FIM的非对角块是安全的,因为块对角方法在计算时间上足够稳健、精确且高效。