Sharpness-aware minimization (SAM) has emerged as a highly effective technique to improve model generalization, but its underlying principles are not fully understood. We investigate m-sharpness, where SAM performance improves monotonically as the micro-batch size for computing perturbations decreases, a phenomenon critical for distributed training yet lacking rigorous explanation. We leverage an extended Stochastic Differential Equation (SDE) framework and analyze stochastic gradient noise (SGN) to characterize the dynamics of SAM variants, including n-SAM and m-SAM. Our analysis reveals that stochastic perturbations induce an implicit variance-based sharpness regularization whose strength increases as m decreases. Motivated by this insight, we propose Reweighted SAM (RW-SAM), which employs sharpness-weighted sampling to mimic the generalization benefits of m-SAM while remaining parallelizable. Comprehensive experiments validate our theory and method.
翻译:锐度感知最小化(SAM)已成为提升模型泛化能力的强效技术,但其内在原理尚未被完全理解。本文研究m-锐度现象——即当计算扰动的微批次规模减小时,SAM性能呈现单调提升,这一现象对分布式训练至关重要却缺乏严格解释。我们利用扩展的随机微分方程(SDE)框架,通过分析随机梯度噪声(SGN)来刻画包括n-SAM与m-SAM在内的SAM变体动态特性。分析表明:随机扰动会诱导一种基于方差的隐式锐度正则化,其强度随m值减小而增强。基于此发现,我们提出重加权SAM(RW-SAM),该方法通过锐度加权采样来模拟m-SAM的泛化优势,同时保持可并行化特性。系统化实验验证了我们的理论与方法。