In this work we explore the limiting dynamics of deep neural networks trained with stochastic gradient descent (SGD). As observed previously, long after performance has converged, networks continue to move through parameter space by a process of anomalous diffusion in which distance travelled grows as a power law in the number of gradient updates with a nontrivial exponent. We reveal an intricate interaction between the hyperparameters of optimization, the structure in the gradient noise, and the Hessian matrix at the end of training that explains this anomalous diffusion. To build this understanding, we first derive a continuous-time model for SGD with finite learning rates and batch sizes as an underdamped Langevin equation. We study this equation in the setting of linear regression, where we can derive exact, analytic expressions for the phase space dynamics of the parameters and their instantaneous velocities from initialization to stationarity. Using the Fokker-Planck equation, we show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the velocity, and probability currents, which cause oscillations in phase space. We identify qualitative and quantitative predictions of this theory in the dynamics of a ResNet-18 model trained on ImageNet. Through the lens of statistical physics, we uncover a mechanistic origin for the anomalous limiting dynamics of deep neural networks trained with SGD.
翻译:本文探究了随机梯度下降(SGD)训练深度神经网络的极限动力学。正如先前观察到的,在性能收敛后很长时间内,网络仍会通过反常扩散过程在参数空间中持续移动,其中移动距离随梯度更新次数呈幂律增长,且具有非平凡指数。我们揭示了优化超参数、梯度噪声结构以及训练结束时的Hessian矩阵之间复杂的相互作用,这一作用解释了反常扩散现象。为构建这一理解,我们首先推导了有限学习率和批次大小的SGD连续时间模型,将其表示为欠阻尼Langevin方程。在线性回归场景下,我们研究了该方程,并得到了从初始化到平稳态的参数及其瞬时速度在相空间中演化的精确解析表达式。利用Fokker-Planck方程,我们证明了驱动这些动力学的关键因素并非原始训练损失,而是修正损失(隐式正则化速度)与概率流(导致相空间振荡)的共同作用。我们在ImageNet上训练的ResNet-18模型动力学中,识别出该理论的定性与定量预测。通过统计物理的视角,我们揭示了SGD训练深度神经网络中反常极限动力学的机制根源。