We study the generalization of over-parameterized classifiers where Empirical Risk Minimization (ERM) for learning leads to zero training error. In these over-parameterized settings there are many global minima with zero training error, some of which generalize better than others. We show that under certain conditions the fraction of "bad" global minima with a true error larger than {\epsilon} decays to zero exponentially fast with the number of training data n. The bound depends on the distribution of the true error over the set of classifier functions used for the given classification problem, and does not necessarily depend on the size or complexity (e.g. the number of parameters) of the classifier function set. This might explain the unexpectedly good generalization even of highly over-parameterized Neural Networks. We support our mathematical framework with experiments on a synthetic data set and a subset of MNIST.
翻译:我们研究了过参数化分类器的泛化性,其中基于经验风险最小化(ERM)的学习会导致零训练误差。在这些过参数化设定中,存在许多具有零训练误差的全局最小值,其中一些泛化性能优于其他。我们证明,在特定条件下,真实误差大于ε的“不良”全局最小值所占比例会随训练数据量n的增加以指数速度衰减至零。该界取决于给定分类问题所用分类器函数集上真实误差的分布,且不一定依赖于分类器函数集的规模或复杂度(例如参数数量)。这可能解释了即使高度过参数化的神经网络也能获得出乎意料好的泛化性能。我们通过在合成数据集和MNIST子集上的实验支持了该数学框架。