As a second-order method, the Natural Gradient Descent (NGD) has the ability to accelerate training of neural networks. However, due to the prohibitive computational and memory costs of computing and inverting the Fisher Information Matrix (FIM), efficient approximations are necessary to make NGD scalable to Deep Neural Networks (DNNs). Many such approximations have been attempted. The most sophisticated of these is KFAC, which approximates the FIM as a block-diagonal matrix, where each block corresponds to a layer of the neural network. By doing so, KFAC ignores the interactions between different layers. In this work, we investigate the interest of restoring some low-frequency interactions between the layers by means of two-level methods. Inspired from domain decomposition, several two-level corrections to KFAC using different coarse spaces are proposed and assessed. The obtained results show that incorporating the layer interactions in this fashion does not really improve the performance of KFAC. This suggests that it is safe to discard the off-diagonal blocks of the FIM, since the block-diagonal approach is sufficiently robust, accurate and economical in computation time.
翻译:作为一种二阶方法,自然梯度下降(NGD)具有加速神经网络训练的能力。然而,由于计算和存储费希尔信息矩阵(FIM)及其逆矩阵所需的计算和内存成本过高,必须采用高效近似方法才能使NGD适用于深度神经网络(DNN)。已有多种此类近似方法被尝试,其中最复杂的是KFAC,它将FIM近似为块对角矩阵,其中每个块对应神经网络的一层。通过这种方式,KFAC忽略了各层之间的相互作用。在本工作中,我们通过两级方法探究恢复层间部分低频相互作用的可行性。受区域分解启发,我们提出并评估了多种基于不同粗空间的两级KFAC修正方案。结果表明,以这种方式纳入层间相互作用并未真正提升KFAC的性能。这证明忽略FIM的非对角块是安全的,因为块对角方法在计算时间上已足够鲁棒、精确且经济。