We demonstrate that applying an eventual decay to the learning rate (LR) in empirical risk minimization (ERM), where the mean-squared-error loss is minimized using standard gradient descent (GD) for training a two-layer neural network with Lipschitz activation functions, ensures that the resulting network exhibits a high degree of Lipschitz regularity, that is, a small Lipschitz constant. Moreover, we show that this decay does not hinder the convergence rate of the empirical risk, now measured with the Huber loss, toward a critical point of the non-convex empirical risk. From these findings, we derive generalization bounds for two-layer neural networks trained with GD and a decaying LR with a sub-linear dependence on its number of trainable parameters, suggesting that the statistical behaviour of these networks is independent of overparameterization. We validate our theoretical results with a series of toy numerical experiments, where surprisingly, we observe that networks trained with constant step size GD exhibit similar learning and regularity properties to those trained with a decaying LR. This suggests that neural networks trained with standard GD may already be highly regular learners.
翻译:我们证明,在经验风险最小化中应用学习率的最终衰减——其中均方误差损失通过标准梯度下降法最小化,用于训练具有Lipschitz激活函数的双层神经网络——能够确保所得网络展现出高度的Lipschitz正则性,即较小的Lipschitz常数。此外,我们表明这种衰减不会阻碍经验风险(现以Huber损失度量)向非凸经验风险临界点的收敛速率。基于这些发现,我们推导出采用梯度下降法与衰减学习率训练的双层神经网络的泛化界,该界对其可训练参数数量具有次线性依赖关系,表明这些网络的统计行为与过参数化无关。我们通过一系列数值模拟实验验证了理论结果,其中令人惊讶地观察到:采用恒定步长梯度下降法训练的网络,与采用衰减学习率训练的网络表现出相似的学习能力和正则化特性。这表明采用标准梯度下降法训练的神经网络可能本身已是高度正则化的学习器。