We systematically analyze optimization dynamics in deep neural networks (DNNs) trained with stochastic gradient descent (SGD) over long time scales and study the effect of learning rate, depth, and width of the neural network. By analyzing the maximum eigenvalue $\lambda^H_t$ of the Hessian of the loss, which is a measure of sharpness of the loss landscape, we find that the dynamics can show four distinct regimes: (i) an early time transient regime, (ii) an intermediate saturation regime, (iii) a progressive sharpening regime, and finally (iv) a late time ``edge of stability" regime. The early and intermediate regimes (i) and (ii) exhibit a rich phase diagram depending on learning rate $\eta \equiv c/\lambda^H_0$, depth $d$, and width $w$. We identify several critical values of $c$ which separate qualitatively distinct phenomena in the early time dynamics of training loss and sharpness, and extract their dependence on $d/w$. Our results have implications for how to scale the learning rate with DNN depth and width in order to remain in the same phase of learning.
翻译:我们系统分析了使用随机梯度下降(SGD)训练的深度神经网络(DNN)在长时间尺度上的优化动力学,并研究了学习率、网络深度与宽度的影响。通过分析损失函数Hessian矩阵的最大特征值$\lambda^H_t$(反映损失景观锐度的度量),我们发现动力学可呈现四种不同阶段:(i)早期瞬态阶段,(ii)中间饱和阶段,(iii)渐进锐化阶段,以及(iv)晚期"稳定性边缘"阶段。早期与中间阶段(i)和(ii)展现出丰富的相图,其行为依赖于学习率$\eta \equiv c/\lambda^H_0$、深度$d$和宽度$w$。我们识别出多个临界值$c$,这些临界值在训练损失与锐度的早期动力学中区分出性质截然不同的现象,并推导出这些临界值对$d/w$的依赖关系。我们的研究结果为如何根据DNN深度与宽度缩放学习率以保持相同学习相提供了理论启示。