In recent years, artificial neural networks have developed into a powerful tool for addressing a multitude of problems for which classical solution approaches reach their limits. However, it is still unclear why gradient descent optimization algorithms with random initialization, such as the well-known batch gradient descent, are able to achieve zero training loss in many situations, even though the objective function is non-convex and non-smooth. One of the most promising approaches to solving this issue in the field of supervised learning is the analysis of gradient descent optimization in the so-called overparameterized regime. In this article, we provide a further contribution to this area of research by considering overparameterized fully connected shallow artificial neural networks with piecewise affine activation, such as the rectified linear unit activation. Specifically, given that the activation function is not affine and the training input data are pairwise distinct, we show that, with high probability, the mean squared error of such a randomly initialized artificial neural network optimized via batch gradient descent converges to zero at a linear convergence rate as long as the width of the artificial neural network is sufficiently large and the learning rate is sufficiently small.
翻译:近年来,人工神经网络已发展成为解决诸多经典方法难以处理问题的强大工具。然而,随机初始化下的梯度下降优化算法(如经典的批量梯度下降)即便在目标函数非凸且非光滑的情况下仍能实现零训练损失,其内在机理仍不明确。在监督学习领域,解决这一问题的前沿途径之一是在所谓的过参数化机制下分析梯度下降优化。本文为该领域的研究提供了新的贡献:针对具有分段仿射激活函数(如修正线性单元激活函数)的过参数化全连接浅层人工神经网络,在激活函数非仿射且训练输入数据两两不同的条件下,我们证明了当网络宽度足够大且学习率足够小时,通过批量梯度下降优化的随机初始化人工神经网络的均方误差将以线性收敛速率依大概率收敛至零。