In this paper, we examine the long-run distribution of stochastic gradient descent (SGD) in general, non-convex problems. Specifically, we seek to understand which regions of the problem's state space are more likely to be visited by SGD, and by how much. Using an approach based on the theory of large deviations and randomly perturbed dynamical systems, we show that the long-run distribution of SGD resembles the Boltzmann-Gibbs distribution of equilibrium thermodynamics with temperature equal to the method's step-size and energy levels determined by the problem's objective and the statistics of the noise. In particular, we show that, in the long run, (a) the problem's critical region is visited exponentially more often than any non-critical region; (b) the iterates of SGD are exponentially concentrated around the problem's minimum energy state (which does not always coincide with the global minimum of the objective); (c) all other connected components of critical points are visited with frequency that is exponentially proportional to their energy level; and, finally (d) any component of local maximizers or saddle points is "dominated" by a component of local minimizers which is visited exponentially more often.
翻译:本文研究了随机梯度下降(SGD)在一般非凸问题中的长期分布。具体而言,我们旨在理解SGD更可能访问问题状态空间的哪些区域,以及访问的频繁程度。基于大偏差理论和随机扰动动力系统的方法,我们证明了SGD的长期分布类似于平衡热力学中的玻尔兹曼-吉布斯分布,其中温度等于方法的步长,能级由问题的目标函数和噪声的统计特性决定。特别地,我们证明在长期运行中:(a)问题的临界区域被访问的次数比任何非临界区域呈指数级更多;(b)SGD的迭代点呈指数级集中在问题的最小能态周围(该状态并不总是与目标函数的全局最小值重合);(c)所有其他连通临界点分量的访问频率与其能级呈指数比例;最后(d)任何局部极大值点或鞍点分量都被一个访问频率呈指数级更高的局部极小值点分量所“主导”。