Recent experiments have shown that, often, when training a neural network with gradient descent (GD) with a step size $\eta$, the operator norm of the Hessian of the loss grows until it approximately reaches $2/\eta$, after which it fluctuates around this value. The quantity $2/\eta$ has been called the "edge of stability" based on consideration of a local quadratic approximation of the loss. We perform a similar calculation to arrive at an "edge of stability" for Sharpness-Aware Minimization (SAM), a variant of GD which has been shown to improve its generalization. Unlike the case for GD, the resulting SAM-edge depends on the norm of the gradient. Using three deep learning training tasks, we see empirically that SAM operates on the edge of stability identified by this analysis.
翻译:近期实验表明,当使用步长 $\eta$ 的梯度下降(GD)训练神经网络时,损失函数Hessian矩阵的算子范数往往会增长直至近似达到 $2/\eta$,此后在该值附近波动。基于对损失函数局部二次近似的分析,$2/\eta$ 这一量被称为"稳定性边缘"。我们对锐度感知最小化(SAM)——一种已被证明能提升泛化性能的GD变体——进行了类似计算,推导出其对应的"稳定性边缘"。与GD的情况不同,所得SAM边缘取决于梯度范数。通过三项深度学习训练任务,我们实证观察到SAM确实运行于该分析所识别的稳定性边缘。