Sharpness-Aware Minimization (SAM) is an optimizer that takes a descent step based on the gradient at a perturbation $y_t = x_t + \rho \frac{\nabla f(x_t)}{\lVert \nabla f(x_t) \rVert}$ of the current point $x_t$. Existing studies prove convergence of SAM for smooth functions, but they do so by assuming decaying perturbation size $\rho$ and/or no gradient normalization in $y_t$, which is detached from practice. To address this gap, we study deterministic/stochastic versions of SAM with practical configurations (i.e., constant $\rho$ and gradient normalization in $y_t$) and explore their convergence properties on smooth functions with (non)convexity assumptions. Perhaps surprisingly, in many scenarios, we find out that SAM has limited capability to converge to global minima or stationary points. For smooth strongly convex functions, we show that while deterministic SAM enjoys tight global convergence rates of $\tilde \Theta(\frac{1}{T^2})$, the convergence bound of stochastic SAM suffers an inevitable additive term $O(\rho^2)$, indicating convergence only up to neighborhoods of optima. In fact, such $O(\rho^2)$ factors arise for stochastic SAM in all the settings we consider, and also for deterministic SAM in nonconvex cases; importantly, we prove by examples that such terms are unavoidable. Our results highlight vastly different characteristics of SAM with vs. without decaying perturbation size or gradient normalization, and suggest that the intuitions gained from one version may not apply to the other.
翻译:锐度感知最小化(SAM)是一种优化器,其基于当前点 $x_t$ 的扰动 $y_t = x_t + \rho \frac{\nabla f(x_t)}{\lVert \nabla f(x_t) \rVert}$ 处的梯度进行下降步骤。现有研究证明了SAM在光滑函数上的收敛性,但通过假设衰减的扰动大小 $\rho$ 和/或 $y_t$ 中无梯度归一化,这脱离了实际应用。为填补这一空白,我们研究了具有实际配置(即恒定 $\rho$ 和 $y_t$ 中梯度归一化)的确定性和随机版本SAM,并探索了它们在具有(非)凸性假设的光滑函数上的收敛性质。令人惊讶的是,在许多场景中,我们发现SAM收敛到全局最小值或驻点的能力有限。对于光滑强凸函数,我们证明虽然确定性SAM具有 $\tilde \Theta(\frac{1}{T^2})$ 的紧致全局收敛率,但随机SAM的收敛界不可避免存在附加项 $O(\rho^2)$,表明仅能收敛至最优点邻域。事实上,此类 $O(\rho^2)$ 因子在我们考虑的所有设置中均出现在随机SAM中,也出现在非凸情形下的确定性SAM中;重要的是,我们通过示例证明此类项是不可避免的。我们的结果凸显了具有与不具有衰减扰动大小或梯度归一化的SAM之间截然不同的特性,并表明从某个版本获得的直觉可能不适用于另一版本。