Understanding the generalization performance of over-parameterized neural networks has become a central topic in deep learning theory. While recent advances, particularly works under the Neural Tangent Kernel (NTK) regime, have shed light on the behavior of shallow architectures, the statistical generalization properties of deep neural networks (DNNs), especially in regression tasks, remain far less understood. In this paper, we make significant progress toward closing this gap by providing a comprehensive generalization analysis of DNNs trained using gradient-based methods. First, we establish, for the first time, a crucial connection between the learning dynamics of a DNN with smooth activation functions trained via gradient-based methods and those of kernel methods, showing that gradient-based methods on over-parameterized DNNs can fully inherit the favorable learning dynamics of their kernel counterparts. Building on this connection and the well-established optimality of kernel methods, we derive the first known minimax-optimal rates for the excess population risk of both gradient descent (GD) and stochastic gradient descent (SGD), under the assumption that network width scales polynomially with the sample size. Our results demonstrate that, with sufficient width, DNNs trained by GD or SGD can achieve generalization performance comparable to kernel-based methods.
翻译:理解过参数化神经网络的泛化性能已成为深度学习理论的核心课题。尽管近期研究进展(特别是在神经切线核(Neural Tangent Kernel,NTK)框架下的工作)揭示了浅层架构的行为特征,但深度神经网络(DNN)在统计泛化性质方面的理解仍远未充分,尤其在回归任务中。本文通过提供使用梯度方法训练的DNN的全面泛化分析,在弥合这一差距方面取得了重要进展。首先,我们首次建立了采用梯度方法训练的具有光滑激活函数的DNN学习动态与核方法之间的关键联系,表明过参数化DNN上的梯度方法能够完全继承其对应核方法的有利学习动态。基于这一联系以及核方法已被充分证明的最优性,我们推导出在假设网络宽度随样本量多项式增长的条件下,梯度下降(GD)和随机梯度下降(SGD)的总体风险超额误差的首个极小化最优速率。我们的结果表明,在足够宽的网络设置下,通过GD或SGD训练的DNN能够实现与基于核的方法相当的泛化性能。