In this paper, we study the data-dependent convergence and generalization behavior of gradient methods for neural networks with smooth activation. Our first result is a novel bound on the excess risk of deep networks trained by the logistic loss, via an alogirthmic stability analysis. Compared to previous works, our results improve upon the shortcomings of the well-established Rademacher complexity-based bounds. Importantly, the bounds we derive in this paper are tighter, hold even for neural networks of small width, do not scale unfavorably with width, are algorithm-dependent, and consequently capture the role of initialization on the sample complexity of gradient descent for deep nets. Specialized to noiseless data separable with margin $\gamma$ by neural tangent kernel (NTK) features of a network of width $\Omega(\text{poly}(\log(n)))$, we show the test-error rate to be $e^{O(L)}/{\gamma^2 n}$, where $n$ is the training set size and $L$ denotes the number of hidden layers. This is an improvement in the test loss bound compared to previous works while maintaining the poly-logarithmic width conditions. We further investigate excess risk bounds for deep nets trained with noisy data, establishing that under a polynomial condition on the network width, gradient descent can achieve the optimal excess risk. Finally, we show that a large step-size significantly improves upon the NTK regime's results in classifying the XOR distribution. In particular, we show for a one-hidden-layer neural network of constant width $m$ with quadratic activation and standard Gaussian initialization that mini-batch SGD with linear sample complexity and with a large step-size $\eta=m$ reaches the perfect test accuracy after only $\ceil{\log(d)}$ iterations, where $d$ is the data dimension.
翻译:本文研究了具有光滑激活函数的神经网络梯度方法的数据依赖性收敛与泛化行为。我们的首个成果是通过算法稳定性分析,对逻辑损失训练的深度网络超额风险提出了新颖的边界。相较于先前研究,我们的结果改进了基于Rademacher复杂度的经典边界存在的缺陷。重要的是,本文推导的边界具有更紧致性,即使对于窄宽度神经网络依然成立,不会随宽度产生不利缩放,具有算法依赖性,从而能够捕捉初始化对深度网络梯度下降样本复杂度的影响。针对可被宽度为$\Omega(\text{poly}(\log(n)))$的神经网络神经正切核特征以间隔$\gamma$线性分离的无噪声数据,我们证明测试错误率为$e^{O(L)}/{\gamma^2 n}$,其中$n$为训练集规模,$L$表示隐藏层数量。在保持多对数宽度条件的同时,该结果相较于先前研究改进了测试损失边界。我们进一步研究了噪声数据训练深度网络的超额风险边界,证明在网络宽度满足多项式条件时,梯度下降能够达到最优超额风险。最后,我们证明大步长能显著改善NTK机制在XOR分布分类任务中的表现。具体而言,对于具有二次激活函数和标准高斯初始化的恒定宽度$m$单隐藏层神经网络,我们证明采用线性样本复杂度和大步长$\eta=m$的小批量SGD仅需$\ceil{\log(d)}$次迭代即可达到完美测试精度,其中$d$为数据维度。