We study the SAM (Sharpness-Aware Minimization) optimizer which has recently attracted a lot of interest due to its increased performance over more classical variants of stochastic gradient descent. Our main contribution is the derivation of continuous-time models (in the form of SDEs) for SAM and two of its variants, both for the full-batch and mini-batch settings. We demonstrate that these SDEs are rigorous approximations of the real discrete-time algorithms (in a weak sense, scaling linearly with the step size). Using these models, we then offer an explanation of why SAM prefers flat minima over sharp ones~--~by showing that it minimizes an implicitly regularized loss with a Hessian-dependent noise structure. Finally, we prove that perhaps unexpectedly SAM is attracted to saddle points under some realistic conditions. Our theoretical results are supported by detailed experiments.
翻译:我们研究了最近因在经典随机梯度下降变体上表现更优而引发广泛关注的SAM(Sharpness-Aware Minimization)优化器。本文的主要贡献是为SAM及其两种变体推导出连续时间模型(以SDE形式),涵盖全批量和迷你批量两种设置。我们证明这些SDE是真实离散时间算法的严格近似(弱意义下,且与步长呈线性关系)。利用这些模型,我们通过证明SAM最小化一个隐含正则化且具有依赖Hessian矩阵的噪声结构的损失函数,揭示了其为何偏好平坦最小值而非尖锐最小值。最后,我们证明在现实条件下,SAM可能会出人意料地被吸引至鞍点。我们的理论结果得到了详细实验的支撑。