Sharpness-Aware Minimization (SAM) is widely used to seek flatter minima -- often linked to better generalization. In its standard implementation, SAM updates the current iterate using the loss gradient evaluated at a point perturbed by distance $ρ$ along the normalized gradient direction. We show that, for some choices of $ρ$, SAM can stall at points where this shifted (perturbed-point) gradient vanishes despite a nonzero original gradient, and therefore, they are not stationary points of the original loss. We call these points hallucinated minimizers, prove their existence under simple nonconvex landscape conditions (e.g., the presence of a local minimizer and a local maximizer), and establish sufficient conditions for local convergence of the SAM iterates to them. We corroborate this failure mode in neural network training and observe that it aligns with SAM's performance degradation often seen at large $ρ$. Finally, as a practical safeguard, we find that a short initial SGD warm-start before enabling SAM mitigates this failure mode and reduces sensitivity to the choice of $ρ$.
翻译:锐度感知最小化(SAM)被广泛用于寻找更平坦的极小值点——这通常与更好的泛化性能相关联。在其标准实现中,SAM通过沿归一化梯度方向移动距离$ρ$的扰动点处评估的损失梯度来更新当前迭代点。我们发现,对于某些$ρ$的选择,SAM可能会停滞在扰动点梯度为零但原始梯度非零的点上,因此这些点并非原始损失的驻点。我们将这些点称为幻觉极小值点,在简单的非凸优化景观条件下(例如存在局部极小值点和局部极大值点)证明了它们的存在性,并建立了SAM迭代局部收敛到这些点的充分条件。我们在神经网络训练中验证了这种失效模式,并观察到它与SAM在大$ρ$值时常见的性能下降现象相符。最后,作为一种实用的防护措施,我们发现启用SAM前进行短暂的初始SGD预热可以有效缓解这种失效模式,并降低对$ρ$值选择的敏感性。