The convergence of GD and SGD when training mildly parameterized neural networks starting from random initialization is studied. For a broad range of models and loss functions, including the most commonly used square loss and cross entropy loss, we prove an ``early stage convergence'' result. We show that the loss is decreased by a significant amount in the early stage of the training, and this decrease is fast. Furthurmore, for exponential type loss functions, and under some assumptions on the training data, we show global convergence of GD. Instead of relying on extreme over-parameterization, our study is based on a microscopic analysis of the activation patterns for the neurons, which helps us derive more powerful lower bounds for the gradient. The results on activation patterns, which we call ``neuron partition'', help build intuitions for understanding the behavior of neural networks' training dynamics, and may be of independent interest.
翻译:本文研究了随机初始化训练轻度参数化神经网络时梯度下降(GD)和随机梯度下降(SGD)的收敛性。针对广泛使用的模型和损失函数(包括最常用的平方损失和交叉熵损失),我们证明了"早期收敛"结果。研究表明,在训练初期损失显著下降,且该下降过程迅速。进一步地,对于指数型损失函数,在训练数据的某些假设条件下,我们证明了GD的全局收敛性。与依赖极端过参数化的方法不同,我们的研究基于对神经元激活模式的微观分析,从而推导出更强大的梯度下界。这些关于激活模式的结果(我们称之为"神经元划分")有助于构建理解神经网络训练动态行为的直觉,并可能具有独立的研究价值。