We rigorously study the joint evolution of training dynamics via stochastic gradient descent (SGD) and the spectra of empirical Hessian and gradient matrices. We prove that in two canonical classification tasks for multi-class high-dimensional mixtures and either 1 or 2-layer neural networks, the SGD trajectory rapidly aligns with emerging low-rank outlier eigenspaces of the Hessian and gradient matrices. Moreover, in multi-layer settings this alignment occurs per layer, with the final layer's outlier eigenspace evolving over the course of training, and exhibiting rank deficiency when the SGD converges to sub-optimal classifiers. This establishes some of the rich predictions that have arisen from extensive numerical studies in the last decade about the spectra of Hessian and information matrices over the course of training in overparametrized networks.
翻译:我们严谨地研究了随机梯度下降(SGD)训练动力学与经验Hessian矩阵及梯度矩阵谱的联合演化过程。我们证明,在多类高维混合数据及1层或2层神经网络的两个经典分类任务中,SGD轨迹会迅速与Hessian矩阵和梯度矩阵中新兴的低秩异常特征空间对齐。此外,在多层设置下,这种对齐会逐层发生,其中最后一层的异常特征空间在训练过程中持续演化,并在SGD收敛至次优分类器时表现出秩亏缺现象。这验证了过去十年间关于过参数化网络中Hessian矩阵与信息矩阵谱的大量数值研究所得出的丰富理论预测。