We investigate the generalization and optimization properties of shallow neural-network classifiers trained by gradient descent in the interpolating regime. Specifically, in a realizable scenario where model weights can achieve arbitrarily small training error $\epsilon$ and their distance from initialization is $g(\epsilon)$, we demonstrate that gradient descent with $n$ training data achieves training error $O(g(1/T)^2 /T)$ and generalization error $O(g(1/T)^2 /n)$ at iteration $T$, provided there are at least $m=\Omega(g(1/T)^4)$ hidden neurons. We then show that our realizable setting encompasses a special case where data are separable by the model's neural tangent kernel. For this and logistic-loss minimization, we prove the training loss decays at a rate of $\tilde O(1/ T)$ given polylogarithmic number of neurons $m=\Omega(\log^4 (T))$. Moreover, with $m=\Omega(\log^{4} (n))$ neurons and $T\approx n$ iterations, we bound the test loss by $\tilde{O}(1/n)$. Our results differ from existing generalization outcomes using the algorithmic-stability framework, which necessitate polynomial width and yield suboptimal generalization rates. Central to our analysis is the use of a new self-bounded weak-convexity property, which leads to a generalized local quasi-convexity property for sufficiently parameterized neural-network classifiers. Eventually, despite the objective's non-convexity, this leads to convergence and generalization-gap bounds that resemble those found in the convex setting of linear logistic regression.
翻译:我们研究在插值机制下,通过梯度下降训练的浅层神经网络分类器的泛化与优化性质。具体而言,在可实现场景中,当模型权重能够达到任意小的训练误差$\epsilon$且其与初始化的距离为$g(\epsilon)$时,我们证明:若隐藏神经元数量至少为$m=\Omega(g(1/T)^4)$,则在迭代次数$T$下,使用$n$个训练数据的梯度下降方法可实现训练误差$O(g(1/T)^2 /T)$和泛化误差$O(g(1/T)^2 /n)$。随后我们证明,该可实现场景包含一种特殊情况:数据可通过模型的神经切向核分离。针对此情形及逻辑损失最小化问题,我们证明在多项式对数级别的神经元数量$m=\Omega(\log^4 (T))$下,训练损失以$\tilde O(1/ T)$的速率衰减。进一步地,当$m=\Omega(\log^{4} (n))$个神经元且$T\approx n$次迭代时,我们将测试损失上界约束为$\tilde{O}(1/n)$。我们的结果与现有基于算法稳定性框架的泛化结论不同,后者需要多项式宽度且仅能获得次优的泛化率。分析的核心在于使用一种新的自界弱凸性性质,该性质为充分参数化的神经网络分类器导出了广义局部拟凸性。最终,尽管目标函数非凸,该方法仍能得到类似于线性逻辑回归凸设定下的收敛性和泛化差距界限。