In this paper, we study the generalization ability of the wide residual network on $\mathbb{S}^{d-1}$ with the ReLU activation function. We first show that as the width $m\rightarrow\infty$, the residual network kernel (RNK) uniformly converges to the residual neural tangent kernel (RNTK). This uniform convergence further guarantees that the generalization error of the residual network converges to that of the kernel regression with respect to the RNTK. As direct corollaries, we then show $i)$ the wide residual network with the early stopping strategy can achieve the minimax rate provided that the target regression function falls in the reproducing kernel Hilbert space (RKHS) associated with the RNTK; $ii)$ the wide residual network can not generalize well if it is trained till overfitting the data. We finally illustrate some experiments to reconcile the contradiction between our theoretical result and the widely observed ``benign overfitting phenomenon''
翻译:本文研究了在 $\mathbb{S}^{d-1}$ 上使用 ReLU 激活函数的宽残差网络的泛化能力。我们首先证明,当宽度 $m\rightarrow\infty$ 时,残差网络核(RNK)一致收敛于残差神经正切核(RNTK)。这一一致收敛进一步保证了残差网络的泛化误差收敛至基于 RNTK 的核回归的泛化误差。作为直接推论,我们随后证明了:$i)$ 采用早停策略的宽残差网络能够达到极小化最优速率,前提是目标回归函数属于与 RNTK 相关的再生核希尔伯特空间(RKHS);$ii)$ 若训练至过拟合数据,宽残差网络无法实现良好泛化。最后,我们通过实验调和了理论结果与广泛观察到的“良性过拟合现象”之间的矛盾。