Training an overparameterized neural network can yield minimizers of the same level of training loss and yet different generalization capabilities. With evidence that indicates a correlation between sharpness of minima and their generalization errors, increasing efforts have been made to develop an optimization method to explicitly find flat minima as more generalizable solutions. This sharpness-aware minimization (SAM) strategy, however, has not been studied much yet as to how overparameterization can actually affect its behavior. In this work, we analyze SAM under varying degrees of overparameterization and present both empirical and theoretical results that suggest a critical influence of overparameterization on SAM. Specifically, we first use standard techniques in optimization to prove that SAM can achieve a linear convergence rate under overparameterization in a stochastic setting. We also show that the linearly stable minima found by SAM are indeed flatter and have more uniformly distributed Hessian moments compared to those of SGD. These results are corroborated with our experiments that reveal a consistent trend that the generalization improvement made by SAM continues to increase as the model becomes more overparameterized. We further present that sparsity can open up an avenue for effective overparameterization in practice.
翻译:训练过参数化的神经网络可以得到具有相同训练损失水平但泛化能力不同的极小值点。有证据表明极小值点的锐度与其泛化误差存在相关性,因此越来越多的研究致力于开发优化方法,以显式寻找泛化能力更强的平坦极小值点。然而,这种锐度感知最小化(SAM)策略在过参数化如何实际影响其行为方面尚未得到充分研究。本文分析了不同过参数化程度下的SAM,并提出了实证和理论结果,表明过参数化对SAM具有关键影响。具体而言,我们首先利用优化中的标准技术证明,在随机环境下,SAM在过参数化条件下能够实现线性收敛速率。我们还发现,与SGD相比,SAM找到的线性稳定极小值点确实更平坦,且其Hessian矩分布更均匀。这些结果得到了实验的验证,实验揭示了一致趋势:随着模型过参数化程度增加,SAM带来的泛化改进持续增强。此外,我们指出稀疏性可为实践中实现有效的过参数化开辟新途径。