We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method for deep networks that has exhibited performance improvements on image and language prediction problems. We show that when SAM is applied with a convex quadratic objective, for most random initializations it converges to a cycle that oscillates between either side of the minimum in the direction with the largest curvature, and we provide bounds on the rate of convergence. In the non-quadratic case, we show that such oscillations effectively perform gradient descent, with a smaller step-size, on the spectral norm of the Hessian. In such cases, SAM's update may be regarded as a third derivative -- the derivative of the Hessian in the leading eigenvector direction -- that encourages drift toward wider minima.
翻译:我们考虑锐度感知最小化(Sharpness-Aware Minimization, SAM),这是一种用于深度网络的基于梯度的优化方法,已在图像和语言预测问题上展现出性能提升。我们证明,当SAM应用于凸二次目标函数时,对于大多数随机初始化,它收敛到一个在具有最大曲率方向上最小值两侧振荡的周期,并给出了收敛速度的界限。在非二次情形下,我们表明这种振荡实际上以更小的步长对Hessian矩阵的谱范数执行梯度下降。在此类情形中,SAM的更新可以被视为三阶导数——即Hessian矩阵在主导特征向量方向上的导数——这促使向更宽广的极小值漂移。