Sharpness-Aware Minimization (SAM) is an optimizer that takes a descent step based on the gradient at a perturbation $y_t = x_t + \rho \frac{\nabla f(x_t)}{\lVert \nabla f(x_t) \rVert}$ of the current point $x_t$. Existing studies prove convergence of SAM for smooth functions, but they do so by assuming decaying perturbation size $\rho$ and/or no gradient normalization in $y_t$, which is detached from practice. To address this gap, we study deterministic/stochastic versions of SAM with practical configurations (i.e., constant $\rho$ and gradient normalization in $y_t$) and explore their convergence properties on smooth functions with (non)convexity assumptions. Perhaps surprisingly, in many scenarios, we find out that SAM has limited capability to converge to global minima or stationary points. For smooth strongly convex functions, we show that while deterministic SAM enjoys tight global convergence rates of $\tilde \Theta(\frac{1}{T^2})$, the convergence bound of stochastic SAM suffers an inevitable additive term $O(\rho^2)$, indicating convergence only up to neighborhoods of optima. In fact, such $O(\rho^2)$ factors arise for stochastic SAM in all the settings we consider, and also for deterministic SAM in nonconvex cases; importantly, we prove by examples that such terms are unavoidable. Our results highlight vastly different characteristics of SAM with vs. without decaying perturbation size or gradient normalization, and suggest that the intuitions gained from one version may not apply to the other.
翻译:锐度感知最小化(SAM)是一种优化器,其根据当前点$x_t$的扰动$y_t = x_t + \rho \frac{\nabla f(x_t)}{\lVert \nabla f(x_t) \rVert}$处的梯度进行下降步骤。现有研究证明了SAM在光滑函数上的收敛性,但需假设扰动大小$\rho$衰减且/或$y_t$中不含梯度归一化,这与实际应用脱节。为填补这一空白,我们研究了实际配置下(即恒定$\rho$和$y_t$中梯度归一化)SAM的确定性与随机版本,并在光滑函数(非)凸假设下探索其收敛性质。令人惊讶的是,在许多场景中,我们发现SAM收敛到全局最小值或驻点的能力有限。对于光滑强凸函数,我们证明确定性SAM具有紧致的全局收敛率$\tilde \Theta(\frac{1}{T^2})$,而随机SAM的收敛界不可避免地存在附加项$O(\rho^2)$,表明其仅能收敛至最优点邻域。事实上,在考虑的所有设置中,随机SAM均出现此类$O(\rho^2)$因子,且非凸情况下的确定性SAM亦如此;重要的是,我们通过示例证明这些项是不可避免的。我们的结果凸显了具有/不具有衰减扰动大小或梯度归一化的SAM之间存在截然不同的特性,并表明从某一版本获得的直觉可能不适用于另一版本。