Sharpness-aware minimization (SAM) has emerged as a highly effective technique for improving model generalization, but its underlying principles are not fully understood. We investigated the phenomenon known as m-sharpness, where the performance of SAM improves monotonically as the micro-batch size for computing perturbations decreases. In practice, the empirical m-sharpness effect underpins the deployment of SAM in distributed training, yet a rigorous theoretical account has remained lacking. To provide a theoretical explanation for m-sharpness, we leverage an extended Stochastic Differential Equation (SDE) framework and analyze the structure of stochastic gradient noise (SGN) to characterize the dynamics of various SAM variants, including n-SAM and m-SAM. Our findings reveal that the stochastic noise introduced during SAM perturbations inherently induces a variance-based sharpness regularization effect. Motivated by our theoretical insights, we introduce Reweighted SAM (RW-SAM), which employs sharpness-weighted sampling to mimic the generalization benefits of m-SAM while remaining parallelizable. Comprehensive experiments validate the effectiveness of our theoretical analysis and proposed method.
翻译:锐度感知最小化(SAM)已成为提升模型泛化能力的强效技术,但其内在原理尚未被完全理解。我们研究了称为m-锐度的现象,即当计算扰动的微批次尺寸减小时,SAM的性能呈现单调提升。实践中,经验性的m-锐度效应支撑了SAM在分布式训练中的部署,然而严谨的理论阐释始终缺失。为解释m-锐度现象,我们利用扩展的随机微分方程(SDE)框架,通过分析随机梯度噪声(SGN)的结构来刻画包括n-SAM与m-SAM在内的多种SAM变体的动态特性。研究结果表明,SAM扰动过程中引入的随机噪声本质上产生了一种基于方差的锐度正则化效应。受理论见解启发,我们提出了重加权SAM(RW-SAM),该方法采用锐度加权采样来模拟m-SAM的泛化优势,同时保持可并行化特性。综合实验验证了我们理论分析及所提方法的有效性。