Convergence rates for gradient descent in the training of overparameterized artificial neural networks with piecewise affine activation

In recent years, artificial neural networks have developed into a powerful tool for addressing a multitude of problems for which classical solution approaches reach their limits. However, it is still unclear why gradient descent optimization algorithms with random initialization, such as the well-known batch gradient descent, are able to achieve zero training loss in many situations, even though the objective function is non-convex and non-smooth. One of the most promising approaches to solving this issue in the field of supervised learning is the analysis of gradient descent optimization in the so-called overparameterized regime. In this article, we provide a further contribution to this area of research by considering overparameterized fully connected shallow artificial neural networks with piecewise affine activation, such as the rectified linear unit activation. Specifically, given that the activation function is not affine and the training input data are pairwise distinct, we show that, with high probability, the mean squared error of such a randomly initialized artificial neural network optimized via batch gradient descent converges to zero at a linear convergence rate as long as the width of the artificial neural network is sufficiently large and the learning rate is sufficiently small.

翻译：近年来，人工神经网络已发展成为解决诸多经典方法难以处理问题的强大工具。然而，随机初始化下的梯度下降优化算法（如经典的批量梯度下降）即便在目标函数非凸且非光滑的情况下仍能实现零训练损失，其内在机理仍不明确。在监督学习领域，解决这一问题的前沿途径之一是在所谓的过参数化机制下分析梯度下降优化。本文为该领域的研究提供了新的贡献：针对具有分段仿射激活函数（如修正线性单元激活函数）的过参数化全连接浅层人工神经网络，在激活函数非仿射且训练输入数据两两不同的条件下，我们证明了当网络宽度足够大且学习率足够小时，通过批量梯度下降优化的随机初始化人工神经网络的均方误差将以线性收敛速率依大概率收敛至零。

相关内容

人工神经网络

关注 130

人工神经网络（Artificial Neural Network，即ANN），它从信息处理角度对人脑神经元网络进行抽象，建立某种简单模型，按不同的连接方式组成不同的网络。在工程与学术界也常直接简称为神经网络或类神经网络。神经网络是一种运算模型，由大量的节点（或称神经元）之间相互联接构成。每个节点代表一种特定的输出函数，称为激励函数（activation function）。每两个节点间的连接都代表一个对于通过该连接信号的加权值，称之为权重，这相当于人工神经网络的记忆。网络的输出则依网络的连接方式，权重值和激励函数的不同而不同。而网络自身通常都是对自然界某种算法或者函数的逼近，也可能是对一种逻辑策略的表达。

【博士论文】理解神经网络的训练动态：从局部优化轨迹与特征学习视角

专知会员服务

14+阅读 · 2025年8月15日

美陆军研究报告《基于熵引导的深度神经网络加速收敛与性能提升方法》最新26页

专知会员服务

17+阅读 · 2025年7月3日

【斯坦福博士论文】神经网络中的特征学习及其他随机探索，238页pdf

专知会员服务

38+阅读 · 2024年7月12日

【ETH博士论文】维数灾难与神经网络的基于梯度训练：缩小理论与应用之间的鸿沟，123页pdf

专知会员服务

35+阅读 · 2023年5月31日