Sharpness-Aware Minimization (SAM) is a recent optimization framework aiming to improve the deep neural network generalization, through obtaining flatter (i.e. less sharp) solutions. As SAM has been numerically successful, recent papers have studied the theoretical aspects of the framework and have shown SAM solutions are indeed flat. However, there has been limited theoretical exploration regarding statistical properties of SAM. In this work, we directly study the statistical performance of SAM, and present a new theoretical explanation of why SAM generalizes well. To this end, we study two statistical problems, neural networks with a hidden layer and kernel regression, and prove under certain conditions, SAM has smaller prediction error over Gradient Descent (GD). Our results concern both convex and non-convex settings, and show that SAM is particularly well-suited for non-convex problems. Additionally, we prove that in our setup, SAM solutions are less sharp as well, showing our results are in agreement with the previous work. Our theoretical findings are validated using numerical experiments on numerous scenarios, including deep neural networks.
翻译:锐度感知最小化(Sharpness-Aware Minimization, SAM)是一种最新的优化框架,旨在通过获得更平坦(即锐度更低)的解来提升深度神经网络的泛化能力。由于SAM在数值上取得了成功,近期研究对框架的理论方面进行了探讨,并证明了SAM的解确实较为平坦。然而,关于SAM统计性质的理论探索仍较为有限。本文直接研究SAM的统计性能,并提出一种新的理论解释,说明为何SAM具有良好泛化能力。为此,我们研究了两个统计问题——含隐藏层的神经网络与核回归,并证明在特定条件下,SAM相较于梯度下降(Gradient Descent, GD)具有更低的预测误差。我们的结论涵盖凸与非凸两种设定,并表明SAM特别适用于非凸问题。此外,我们证明在本研究的设定中,SAM的解同样锐度更低,这与先前工作的结论一致。我们的理论发现通过包括深度神经网络在内的多种场景数值实验得到了验证。