In this paper, we examine the time it takes for stochastic gradient descent (SGD) to reach the global minimum of a general, non-convex loss function. We approach this question through the lens of randomly perturbed dynamical systems and large deviations theory, and we provide a tight characterization of the global convergence time of SGD via matching upper and lower bounds. These bounds are dominated by the most "costly" set of obstacles that the algorithm may need to overcome to reach a global minimizer from a given initialization, coupling in this way the global geometry of the underlying loss landscape with the statistics of the noise entering the process. Finally, motivated by applications to the training of deep neural networks, we also provide a series of refinements and extensions of our analysis for loss functions with shallow local minima.
翻译:本文研究了随机梯度下降(SGD)到达一般非凸损失函数全局最小值所需的时间。我们通过随机扰动动力系统和大偏差理论的视角探讨这一问题,并通过匹配的上下界给出了SGD全局收敛时间的精确刻画。这些边界由算法从给定初始点到达全局极小值可能需要克服的最“昂贵”障碍集所主导,从而将底层损失函数的全局几何结构与进入过程的噪声统计特性耦合起来。最后,受深度神经网络训练应用的启发,我们还针对具有浅层局部极小值的损失函数,提出了一系列分析方法的改进与扩展。