Max-Margin Works while Large Margin Fails: Generalization without Uniform Convergence

A major challenge in modern machine learning is theoretically understanding the generalization properties of overparameterized models. Many existing tools rely on uniform convergence (UC), a property that, when it holds, guarantees that the test loss will be close to the training loss, uniformly over a class of candidate models. Nagarajan and Kolter (2019) show that in certain simple linear and neural-network settings, any uniform convergence bound will be vacuous, leaving open the question of how to prove generalization in settings where UC fails. Our main contribution is proving novel generalization bounds in two such settings, one linear, and one non-linear. We study the linear classification setting of Nagarajan and Kolter, and a quadratic ground truth function learned via a two-layer neural network in the non-linear regime. We prove a new type of margin bound showing that above a certain signal-to-noise threshold, any near-max-margin classifier will achieve almost no test loss in these two settings. Our results show that near-max-margin is important: while any model that achieves at least a $(1 - \epsilon)$-fraction of the max-margin generalizes well, a classifier achieving half of the max-margin may fail terribly. Building on the impossibility results of Nagarajan and Kolter, under slightly stronger assumptions, we show that one-sided UC bounds and classical margin bounds will fail on near-max-margin classifiers. Our analysis provides insight on why memorization can coexist with generalization: we show that in this challenging regime where generalization occurs but UC fails, near-max-margin classifiers simultaneously contain some generalizable components and some overfitting components that memorize the data. The presence of the overfitting components is enough to preclude UC, but the near-extremal margin guarantees that sufficient generalizable components are present.

翻译：现代机器学习中的一个主要挑战是理论上理解过参数化模型的泛化性质。许多现有工具依赖于一致收敛（UC）——当该性质成立时，能保证测试损失在候选模型类别上一致地接近训练损失。Nagarajan和Kolter（2019）指出，在某些简单的线性与神经网络设定中，任何一致收敛界都将失效，这提出了在UC失效情况下如何证明泛化的问题。我们的主要贡献是在两种此类设定（一种线性、一种非线性）中证明了新颖的泛化界。我们研究了Nagarajan和Kolter的线性分类设定，以及通过两层神经网络在非线性区域学习二次真实函数的情形。我们证明了一种新型间隔界：在超过特定信噪比阈值时，任何接近最大间隔的分类器在这两种设定中几乎不会产生测试损失。结果表明接近最大间隔至关重要：虽然任何达到至少$(1-\epsilon)$倍最大间隔的模型都能良好泛化，但仅达到一半最大间隔的分类器可能彻底失败。基于Nagarajan和Kolter的不可行性结论，我们在略强的假设下证明，单侧UC界和经典间隔界将在接近最大间隔的分类器上失效。我们的分析揭示了记忆与泛化共存的原因：在泛化发生但UC失效的挑战性场景中，接近最大间隔的分类器同时包含可泛化组件和过拟合组件（用于记忆数据）。过拟合组件的存在足以阻碍UC，但近极值间隔保证了足够可泛化组件的存在。