The theory of training deep networks has become a central question of modern machine learning and has inspired many practical advancements. In particular, the gradient descent (GD) optimization algorithm has been extensively studied in recent years. A key assumption about GD has appeared in several recent works: the \emph{GD map is non-singular} -- it preserves sets of measure zero under preimages. Crucially, this assumption has been used to prove that GD avoids saddle points and maxima, and to establish the existence of a computable quantity that determines the convergence to global minima (both for GD and stochastic GD). However, the current literature either assumes the non-singularity of the GD map or imposes restrictive assumptions, such as Lipschitz smoothness of the loss (for example, Lipschitzness does not hold for deep ReLU networks with the cross-entropy loss) and restricts the analysis to GD with small step-sizes. In this paper, we investigate the neural network map as a function on the space of weights and biases. We also prove, for the first time, the non-singularity of the gradient descent (GD) map on the loss landscape of realistic neural network architectures (with fully connected, convolutional, or softmax attention layers) and piecewise analytic activations (which includes sigmoid, ReLU, leaky ReLU, etc.) for almost all step-sizes. Our work significantly extends the existing results on the convergence of GD and SGD by guaranteeing that they apply to practical neural network settings and has the potential to unlock further exploration of learning dynamics.
翻译:深度网络训练理论已成为现代机器学习的核心问题,并推动了诸多实际进展。特别是,梯度下降(GD)优化算法近年来得到了广泛研究。近期多项工作中出现了一个关于GD的关键假设:\emph{GD映射是非奇异的}——它在原像下保持零测集不变。至关重要的是,这一假设已被用于证明GD能够避开鞍点和极大值点,并确立了一个可计算量的存在性,该量决定了GD和随机梯度下降(SGD)向全局极小值的收敛性。然而,现有文献要么直接假设GD映射的非奇异性,要么施加了限制性假设,例如损失函数的Lipschitz光滑性(例如,对于使用交叉熵损失的深度ReLU网络,Lipschitz性质并不成立),并将分析局限于小步长的GD。本文中,我们将神经网络映射作为权重和偏置空间上的函数进行研究。首次证明了在几乎所有步长下,梯度下降(GD)映射在现实神经网络架构(包含全连接层、卷积层或softmax注意力层)和分段解析激活函数(包括sigmoid、ReLU、Leaky ReLU等)的损失景观上具有非奇异性。我们的工作显著扩展了现有关于GD和SGD收敛性的结果,确保其适用于实际的神经网络设置,并有望为学习动力学的进一步探索开辟道路。