Recently, significant progress has been made in understanding the generalization of neural networks (NNs) trained by gradient descent (GD) using the algorithmic stability approach. However, most of the existing research has focused on one-hidden-layer NNs and has not addressed the impact of different network scaling parameters. In this paper, we greatly extend the previous work \cite{lei2022stability,richards2021stability} by conducting a comprehensive stability and generalization analysis of GD for multi-layer NNs. For two-layer NNs, our results are established under general network scaling parameters, relaxing previous conditions. In the case of three-layer NNs, our technical contribution lies in demonstrating its nearly co-coercive property by utilizing a novel induction strategy that thoroughly explores the effects of over-parameterization. As a direct application of our general findings, we derive the excess risk rate of $O(1/\sqrt{n})$ for GD algorithms in both two-layer and three-layer NNs. This sheds light on sufficient or necessary conditions for under-parameterized and over-parameterized NNs trained by GD to attain the desired risk rate of $O(1/\sqrt{n})$. Moreover, we demonstrate that as the scaling parameter increases or the network complexity decreases, less over-parameterization is required for GD to achieve the desired error rates. Additionally, under a low-noise condition, we obtain a fast risk rate of $O(1/n)$ for GD in both two-layer and three-layer NNs.
翻译:近期,利用算法稳定性方法理解由梯度下降(GD)训练的神经网络(NNs)泛化性能的研究取得了重大进展。然而,现有研究大多聚焦于单隐藏层神经网络,且未针对不同网络缩放参数的影响进行探讨。本文通过开展GD对多层神经网络的全面稳定性与泛化分析,极大地拓展了先前工作\cite{lei2022stability,richards2021stability}。针对两层神经网络,我们在一般网络缩放参数下建立了结果,放宽了先前条件。对于三层神经网络,我们的技术贡献在于利用一种新颖的归纳策略,深入探索了过参数化的影响,从而证明其近乎共协(nearly co-coercive)性质。作为通用结论的直接应用,我们推导出两层和三层神经网络中GD算法的超额风险率为$O(1/\sqrt{n})$。这揭示了由GD训练的欠参数化与过参数化神经网络达到期望风险率$O(1/\sqrt{n})$的充分或必要条件。此外,我们证明,随着缩放参数增大或网络复杂度降低,GD实现期望误差率所需的过参数化程度降低。在低噪声条件下,我们进一步得到两层和三层神经网络中GD的快速风险率$O(1/n)$。