We consider nonparametric regression by an over-parameterized two-layer neural network trained by gradient descent (GD) or its variant in this paper. We show that, if the neural network is trained with a novel Preconditioned Gradient Descent (PGD) with early stopping and the target function has spectral bias widely studied in the deep learning literature, the trained network renders a particularly sharp generalization bound with a minimax optimal rate of $\cO({1}/{n^{4\alpha/(4\alpha+1)}})$, which is sharper the current standard rate of $\cO({1}/{n^{2\alpha/(2\alpha+1)}})$ with $2\alpha = d/(d-1)$ when the data is distributed uniformly on the unit sphere in $\RR^d$ and $n$ is the size of the training data. When the target function has no spectral bias, we prove that neural network trained with regular GD with early stopping still enjoys minimax optimal rate, and in this case our results do not require distributional assumptions in contrast with the current known results. Our results are built upon two significant technical contributions. First, uniform convergence to the NTK is established during the training process by PGD or GD, so that we can have a nice decomposition of the neural network function at any step of GD or PGD into a function in the RKHS and an error function with a small $L^{\infty}$-norm. Second, local Rademacher complexity is employed to tightly bound the Rademacher complexity of the function class comprising all the possible neural network functions obtained by GD or PGD. Our results also indicate that PGD can be another way of avoiding the usual linear regime of NTK and obtaining sharper generalization bound, because PGD induces a different kernel with lower kernel complexity during the training than the regular NTK induced by the network architecture trained by regular GD.
翻译:本文研究通过过参数化双层神经网络进行非参数回归,该网络采用梯度下降法或其变体进行训练。我们证明,若神经网络采用新颖的预条件梯度下降法配合早停策略进行训练,且目标函数具有深度学习文献中广泛研究的谱偏置特性,则训练所得网络能实现特别尖锐的泛化界,达到极小极大最优速率$\cO({1}/{n^{4\alpha/(4\alpha+1)}})$。该速率比当前标准速率$\cO({1}/{n^{2\alpha/(2\alpha+1)}})$更为尖锐,其中$2\alpha = d/(d-1)$适用于数据均匀分布于$\RR^d$单位球面且$n$为训练数据量的情形。当目标函数不具备谱偏置时,我们证明采用常规梯度下降法配合早停训练的神经网络仍能保持极小极大最优速率,且在此情况下我们的结果无需分布假设,这与当前已知结果形成对比。我们的结论建立在两项重要技术贡献之上:首先,通过预条件梯度下降法或常规梯度下降法在训练过程中建立了向神经正切核的一致收敛性,从而可将梯度下降法或预条件梯度下降法任意步骤的神经网络函数分解为再生核希尔伯特空间中的函数与具有小$L^{\infty}$范数的误差函数之和;其次,采用局部Rademacher复杂度严格界定了由梯度下降法或预条件梯度下降法生成的所有可能神经网络函数所构成函数类的Rademacher复杂度。我们的研究结果还表明,预条件梯度下降法可作为避免常规神经正切核线性机制并获得更尖锐泛化界的另一途径,因为相较于常规梯度下降法训练网络架构所诱导的标准神经正切核,预条件梯度下降法在训练过程中诱导了具有更低核复杂度的新型核函数。