We systematically analyze optimization dynamics in deep neural networks (DNNs) trained with stochastic gradient descent (SGD) and study the effect of learning rate $\eta$, depth $d$, and width $w$ of the neural network. By analyzing the maximum eigenvalue $\lambda^H_t$ of the Hessian of the loss, which is a measure of sharpness of the loss landscape, we find that the dynamics can show four distinct regimes: (i) an early time transient regime, (ii) an intermediate saturation regime, (iii) a progressive sharpening regime, and (iv) a late time ``edge of stability" regime. The early and intermediate regimes (i) and (ii) exhibit a rich phase diagram depending on $\eta \equiv c / \lambda_0^H $, $d$, and $w$. We identify several critical values of $c$, which separate qualitatively distinct phenomena in the early time dynamics of training loss and sharpness. Notably, we discover the opening up of a ``sharpness reduction" phase, where sharpness decreases at early times, as $d$ and $1/w$ are increased.
翻译:我们系统分析了使用随机梯度下降(SGD)训练的深度神经网络(DNN)中的优化动力学,研究了学习率$\eta$、网络深度$d$和宽度$w$的影响。通过分析损失函数Hessian矩阵的最大特征值$\lambda^H_t$(损失景观尖锐度的度量),我们发现动力学可呈现四种不同区域:(i)早期瞬态区域,(ii)中间饱和区域,(iii)渐进尖锐化区域,以及(iv)后期"稳定性边界"区域。早期和中间区域(i)和(ii)表现出丰富的相图,其行为取决于$\eta \equiv c / \lambda_0^H$、$d$和$w$。我们识别出$c$的几个临界值,这些临界值区分了训练损失和尖锐度早期动力学中性质截然不同的现象。值得注意的是,我们发现随着$d$和$1/w$的增大,会出现一个"尖锐度降低"相,即尖锐度在早期阶段下降。