Despite extensive studies, the underlying reason as to why overparameterized neural networks can generalize remains elusive. Existing theory shows that common stochastic optimizers prefer flatter minimizers of the training loss, and thus a natural potential explanation is that flatness implies generalization. This work critically examines this explanation. Through theoretical and empirical investigation, we identify the following three scenarios for two-layer ReLU networks: (1) flatness provably implies generalization; (2) there exist non-generalizing flattest models and sharpness minimization algorithms fail to generalize, and (3) perhaps most surprisingly, there exist non-generalizing flattest models, but sharpness minimization algorithms still generalize. Our results suggest that the relationship between sharpness and generalization subtly depends on the data distributions and the model architectures and sharpness minimization algorithms do not only minimize sharpness to achieve better generalization. This calls for the search for other explanations for the generalization of over-parameterized neural networks.
翻译:尽管已有大量研究,但过参数化神经网络能够实现泛化的根本原因仍不明确。现有理论表明,常见随机优化器倾向于选择训练损失曲面上更平坦的极小值点,因此一种自然的潜在解释是平坦性意味着泛化能力。本文对此解释进行了批判性研究。通过理论与实验分析,我们在两层ReLU网络中识别出以下三种场景:(1)平坦性可证明蕴含泛化能力;(2)存在不泛化的最平坦模型,且锐度最小化算法无法实现泛化;(3)最令人意外的是,存在不泛化的最平坦模型,但锐度最小化算法仍能实现泛化。我们的结果表明,锐度与泛化之间的关系微妙地依赖于数据分布和模型架构,且锐度最小化算法并非仅通过降低锐度来获得更好的泛化性能。这呼吁学界探索过参数化神经网络泛化能力背后的其他解释机制。