Recent progress has been made in understanding the statistical generalization performance of gradient descent methods for overparameterized neural networks within the neural tangent kernel (NTK) regime. However, most of the existing work on regression problems is limited to shallow network architectures, leaving a notable gap in the theory of deep neural networks. This paper addresses this gap by presenting a comprehensive generalization analysis for deep ReLU networks trained using gradient descent (GD) and stochastic gradient descent (SGD). Specifically, we establish the first known minimax-optimal rates of excess population risk for both GD and SGD with deep ReLU networks, under the assumption that the network width scales polynomially with respect to the network depth and training sample size. Our results demonstrate that with sufficient width, gradient descent methods for deep ReLU networks can achieve optimal generalization rates on par with kernel methods.
翻译:近期研究在神经正切核(NTK)框架下,对过参数化神经网络中梯度下降法的统计泛化性能理解取得了进展。然而,现有关于回归问题的工作大多局限于浅层网络结构,导致深度神经网络理论存在显著空白。本文通过系统分析使用梯度下降(GD)与随机梯度下降(SGD)训练的深度ReLU网络的泛化性能来填补这一空白。具体而言,在网络宽度关于网络深度与训练样本数量呈多项式量级的假设下,我们首次为深度ReLU网络建立了GD与SGD的极小化最优(minimax-optimal)超额总体风险速率。研究结果表明,当网络宽度充分大时,深度ReLU网络的梯度下降方法可达到与核方法相当的最优泛化速率。