Sharpness-Aware Minimization (SAM) is a recent optimization framework aiming to improve the deep neural network generalization, through obtaining flatter (i.e. less sharp) solutions. As SAM has been numerically successful, recent papers have studied the theoretical aspects of the framework. In this work, we study SAM through an implicit regularization lens, and present a new theoretical explanation of why SAM generalizes well. To this end, we study the least-squares linear regression problem and show a bias-variance trade-off for SAM's error over the course of the algorithm. We show SAM has lower bias compared to Gradient Descent (GD), while having higher variance. This shows SAM can outperform GD, specially if the algorithm is \emph{stopped early}, which is often the case when training large neural networks due to the prohibitive computational cost. We extend our results to kernel regression, as well as stochastic optimization and discuss how implicit regularization of SAM can improve upon vanilla training.
翻译:锐度感知最小化(Sharpness-Aware Minimization, SAM)是一种近期提出的优化框架,旨在通过获得更平坦(即锐度更低)的解来提升深度神经网络的泛化能力。鉴于SAM在数值实验中的成功,近期研究开始关注该框架的理论层面。本文从隐式正则化的角度研究SAM,并提出一种新的理论解释以阐明SAM为何具备良好的泛化性能。为此,我们以最小二乘线性回归问题为研究对象,揭示了算法过程中SAM误差的偏差-方差权衡关系。结果表明,与梯度下降法(Gradient Descent, GD)相比,SAM具有更低的偏差但更高的方差。这解释了SAM能够超越GD的原因,特别是在算法被提前停止时——由于计算成本过高,大神经网络的训练往往遵循此模式。我们进一步将结果推广至核回归与随机优化场景,并讨论了隐式正则化如何使SAM优于传统训练方法。