Sharpness-Aware Minimization (SAM) is a recently proposed gradient-based optimizer (Foret et al., ICLR 2021) that greatly improves the prediction performance of deep neural networks. Consequently, there has been a surge of interest in explaining its empirical success. We focus, in particular, on understanding the role played by normalization, a key component of the SAM updates. We theoretically and empirically study the effect of normalization in SAM for both convex and non-convex functions, revealing two key roles played by normalization: i) it helps in stabilizing the algorithm; and ii) it enables the algorithm to drift along a continuum (manifold) of minima -- a property identified by recent theoretical works that is the key to better performance. We further argue that these two properties of normalization make SAM robust against the choice of hyper-parameters, supporting the practicality of SAM. Our conclusions are backed by various experiments.
翻译:锐度感知最小化(SAM)是一种近期提出的基于梯度的优化器(Foret 等人,ICLR 2021),它显著提升了深度神经网络的预测性能。因此,解释其经验成功引起了广泛兴趣。我们特别关注理解归一化(SAM 更新中的一个关键组件)所扮演的角色。我们从理论和实证两方面研究了 SAM 中归一化对凸函数和非凸函数的影响,揭示了归一化的两个关键作用:(i)它有助于稳定算法;(ii)它使算法能够沿着一个连续的最小值集合(流形)漂移——这是近期理论工作所识别的、能提升性能的关键特性。我们进一步论证,归一化的这两个特性使 SAM 对超参数的选择具有鲁棒性,从而支持了 SAM 的实用性。我们的结论得到了多种实验的验证。