The challenge of overfitting, in which the model memorizes the training data and fails to generalize to test data, has become increasingly significant in the training of large neural networks. To tackle this challenge, Sharpness-Aware Minimization (SAM) has emerged as a promising training method, which can improve the generalization of neural networks even in the presence of label noise. However, a deep understanding of how SAM works, especially in the setting of nonlinear neural networks and classification tasks, remains largely missing. This paper fills this gap by demonstrating why SAM generalizes better than Stochastic Gradient Descent (SGD) for a certain data model and two-layer convolutional ReLU networks. The loss landscape of our studied problem is nonsmooth, thus current explanations for the success of SAM based on the Hessian information are insufficient. Our result explains the benefits of SAM, particularly its ability to prevent noise learning in the early stages, thereby facilitating more effective learning of features. Experiments on both synthetic and real data corroborate our theory.
翻译:过拟合问题(模型记忆训练数据而无法泛化至测试数据)在大规模神经网络训练中日益显著。为应对这一挑战,锐度感知最小化(SAM)作为极具潜力的训练方法应运而生,即使在存在标签噪声的情况下也能提升神经网络的泛化能力。然而,关于SAM工作机制的深入理解(尤其在非线性神经网络与分类任务场景下)仍存在显著空白。本文通过特定数据模型与双层卷积ReLU网络,实证了SAM为何比随机梯度下降(SGD)具有更优泛化性能。我们研究问题的损失曲面是非光滑的,因此当前基于Hessian矩阵信息解释SAM优势的理论框架具有局限性。本研究成果揭示了SAM的收益机制,特别是其通过抑制早期阶段噪声学习的能力,从而促进特征的有效学习。基于合成数据与真实数据的实验验证了我们的理论。