Second-order methods for deep learning -- such as KFAC -- can be useful for neural net training. However, they are often memory-inefficient and numerically unstable for low-precision training since their preconditioning Kronecker factors are dense, and require high-precision matrix inversion or decomposition. Consequently, such methods are not widely used for training large neural networks such as transformer-based models. We address these two issues by (i) formulating an inverse-free update of KFAC and (ii) imposing structures in each of the Kronecker factors, resulting in a method we term structured inverse-free natural gradient descent (SINGD). On large modern neural networks, we show that, in contrast to KFAC, SINGD is memory efficient and numerically robust, and often outperforms AdamW even in half precision. Hence, our work closes a gap between first-order and second-order methods in modern low precision training for large neural nets.
翻译:深度学习中的二阶方法(如KFAC)对神经网络训练具有重要价值。然而,由于预条件Kronecker因子具有稠密性且需要高精度矩阵求逆或分解,此类方法在低精度训练中常面临内存效率低下及数值不稳定的问题。因此,这类方法在训练基于Transformer等架构的大型神经网络时尚未得到广泛应用。本文通过以下两项创新解决上述问题:(i)构建KFAC的无逆更新公式,(ii)为每个Kronecker因子引入结构化约束,最终提出结构化无逆自然梯度下降方法(SINGD)。在当代大型神经网络上的实验表明,与KFAC相比,SINGD兼具内存高效性与数值鲁棒性,即使在半精度训练中性能也常优于AdamW。本研究工作弥合了现代大型神经网络低精度训练场景下,一阶方法与二阶方法之间的性能鸿沟。