Sharpness-Aware Minimization (SAM) is an optimizer that takes a descent step based on the gradient at a perturbation $y_t = x_t + \rho \frac{\nabla f(x_t)}{\lVert \nabla f(x_t) \rVert}$ of the current point $x_t$. Existing studies prove convergence of SAM for smooth functions, but they do so by assuming decaying perturbation size $\rho$ and/or no gradient normalization in $y_t$, which is detached from practice. To address this gap, we study deterministic/stochastic versions of SAM with practical configurations (i.e., constant $\rho$ and gradient normalization in $y_t$) and explore their convergence properties on smooth functions with (non)convexity assumptions. Perhaps surprisingly, in many scenarios, we find out that SAM has limited capability to converge to global minima or stationary points. For smooth strongly convex functions, we show that while deterministic SAM enjoys tight global convergence rates of $\tilde \Theta(\frac{1}{T^2})$, the convergence bound of stochastic SAM suffers an inevitable additive term $O(\rho^2)$, indicating convergence only up to neighborhoods of optima. In fact, such $O(\rho^2)$ factors arise for stochastic SAM in all the settings we consider, and also for deterministic SAM in nonconvex cases; importantly, we prove by examples that such terms are unavoidable. Our results highlight vastly different characteristics of SAM with vs. without decaying perturbation size or gradient normalization, and suggest that the intuitions gained from one version may not apply to the other.
翻译:锐度感知最小化(SAM)是一种优化器,它基于当前点 $x_t$ 处扰动 $y_t = x_t + \rho \frac{\nabla f(x_t)}{\lVert \nabla f(x_t) \rVert}$ 的梯度进行下降步骤。现有研究证明了SAM在光滑函数上的收敛性,但通常假设扰动大小 $\rho$ 衰减且/或 $y_t$ 中不使用梯度归一化,这与实际应用脱节。为弥补这一空白,我们研究了实际配置下(即固定 $\rho$ 和 $y_t$ 中的梯度归一化)SAM的确定性/随机版本,并在(非)凸假设的光滑函数上探索其收敛性质。令人惊讶的是,在许多场景下,我们发现SAM收敛到全局最小值或驻点的能力有限。对于光滑强凸函数,确定性SAM具有 $\tilde \Theta(\frac{1}{T^2})$ 的紧致全局收敛速率,但随机SAM的收敛界不可避免地带有一个附加项 $O(\rho^2)$,表明仅能收敛至最优解的邻域。实际上,在我们考虑的所有设置中,随机SAM均出现此 $O(\rho^2)$ 因子,且对于非凸情况下的确定性SAM也如此;重要的是,我们通过示例证明了此类项是不可消除的。我们的结果凸显了是否使用衰减扰动大小或梯度归一化的SAM在特性上的巨大差异,并表明从一种版本获得的直觉可能不适用于另一种版本。