A candidate explanation of the good empirical performance of deep neural networks is the implicit regularization effect of first order optimization methods. Inspired by this, we prove a convergence theorem for nonconvex composite optimization, and apply it to a general learning problem covering many machine learning applications, including supervised learning. We then present a deep multilayer perceptron model and prove that, when sufficiently wide, it $(i)$ leads to the convergence of gradient descent to a global optimum with a linear rate, $(ii)$ benefits from the implicit regularization effect of gradient descent, $(iii)$ is subject to novel bounds on the generalization error, $(iv)$ exhibits the lazy training phenomenon and $(v)$ enjoys learning rate transfer across different widths. The corresponding coefficients, such as the convergence rate, improve as width is further increased, and depend on the even order moments of the data generating distribution up to an order depending on the number of layers. The only non-mild assumption we make is the concentration of the smallest eigenvalue of the neural tangent kernel at initialization away from zero, which has been shown to hold for a number of less general models in contemporary works. We present empirical evidence supporting this assumption as well as our theoretical claims.
翻译:关于深度神经网络良好实证表现的一个候选解释是一阶优化方法的隐式正则化效应。受此启发,我们证明了一个非凸复合优化问题的收敛定理,并将其应用于涵盖众多机器学习应用(包括监督学习)的一般性学习问题。随后,我们提出一个深度多层感知机模型,并证明当网络宽度足够大时,该模型(i)能以线性速率实现梯度下降收敛到全局最优解,(ii)受益于梯度下降的隐式正则化效应,(iii)受限于全新的泛化误差上界,(iv)展现懒训练现象,且(v)在不同宽度下实现学习率迁移。相应的系数(如收敛速率)随宽度增加而改善,其取值依赖于数据生成分布直至与层数相关阶数的偶数阶矩。我们所作的唯一非温和假设是神经正切核在初始化时最小特征值远离零,这在当代工作中已被多个较一般性模型所验证。我们同时提供了支持该假设及理论结论的实证证据。