Sharpness-Aware Minimization (SAM) is a recent optimization framework aiming to improve the deep neural network generalization, through obtaining flatter (i.e. less sharp) solutions. As SAM has been numerically successful, recent papers have studied the theoretical aspects of the framework. In this work, we study SAM through an implicit regularization lens, and present a new theoretical explanation of why SAM generalizes well. To this end, we study the least-squares linear regression problem and show a bias-variance trade-off for SAM's error over the course of the algorithm. We show SAM has lower bias compared to Gradient Descent (GD), while having higher variance. This shows SAM can outperform GD, specially if the algorithm is \emph{stopped early}, which is often the case when training large neural networks due to the prohibitive computational cost. We extend our results to kernel regression, as well as stochastic optimization and discuss how implicit regularization of SAM can improve upon vanilla training.
翻译:锐度感知最小化(SAM)是一种近期提出的优化框架,旨在通过获得更平坦(即锐度较低)的解来提升深度神经网络的泛化性能。由于SAM在数值上已取得显著成功,近期有多篇论文对其理论层面进行了研究。本文从隐式正则化视角研究SAM,并提出一种新的理论解释,以阐明SAM为何能实现良好的泛化性能。为此,我们分析最小二乘线性回归问题,并揭示了SAM误差随算法进程呈现的偏置-方差权衡特性。研究表明,相较于梯度下降法(GD),SAM具有更低的偏置但更高的方差。这表明SAM能够优于GD,特别是在算法被"提前停止"时——而这正是训练大型神经网络时因计算成本过高而普遍采用的做法。我们将结论扩展至核回归及随机优化领域,并讨论了SAM的隐式正则化如何改进基础训练方法。