Recently, significant progress has been made in understanding the generalization of neural networks (NNs) trained by gradient descent (GD) using the algorithmic stability approach. However, most of the existing research has focused on one-hidden-layer NNs and has not addressed the impact of different network scaling parameters. In this paper, we greatly extend the previous work \cite{lei2022stability,richards2021stability} by conducting a comprehensive stability and generalization analysis of GD for multi-layer NNs. For two-layer NNs, our results are established under general network scaling parameters, relaxing previous conditions. In the case of three-layer NNs, our technical contribution lies in demonstrating its nearly co-coercive property by utilizing a novel induction strategy that thoroughly explores the effects of over-parameterization. As a direct application of our general findings, we derive the excess risk rate of $O(1/\sqrt{n})$ for GD algorithms in both two-layer and three-layer NNs. This sheds light on sufficient or necessary conditions for under-parameterized and over-parameterized NNs trained by GD to attain the desired risk rate of $O(1/\sqrt{n})$. Moreover, we demonstrate that as the scaling parameter increases or the network complexity decreases, less over-parameterization is required for GD to achieve the desired error rates. Additionally, under a low-noise condition, we obtain a fast risk rate of $O(1/n)$ for GD in both two-layer and three-layer NNs.
翻译:最近,通过算法稳定性方法理解梯度下降(GD)训练神经网络(NNs)的泛化性取得了显著进展。然而,现有研究大多集中于单隐层神经网络,且未考虑不同网络尺度参数的影响。本文通过全面分析GD对多层神经网络的稳定性和泛化性,极大拓展了先前工作 \cite{lei2022stability,richards2021stability}。对于两层神经网络,我们在一般网络尺度参数下建立结果,放宽了先前的条件。对于三层神经网络,我们的技术贡献在于通过一种新颖的归纳策略,充分探讨过参数化的影响,证明其近乎共强制性质。基于我们的通用发现,我们推导出GD算法在两层和三层神经网络中的过剩风险率为 $O(1/\sqrt{n})$。这揭示了欠参数化和过参数化神经网络在GD训练下达到期望风险率 $O(1/\sqrt{n})$ 的充分或必要条件。此外,我们证明,当尺度参数增大或网络复杂度降低时,GD实现期望误差率所需的过参数化程度会减弱。在低噪声条件下,我们还得到了GD在两层和三层神经网络中的快速风险率 $O(1/n)$。