Characterizing and understanding the stability of Stochastic Gradient Descent (SGD) remains an open problem in deep learning. A common method is to utilize the convergence of statistical moments, esp. the variance, of the parameters to quantify the stability. We revisit the definition of stability for SGD and propose using the $\textit{convergence in probability}$ condition to define the $\textit{probabilistic stability}$ of SGD. The probabilistic stability sheds light on a fundamental question in deep learning theory: how SGD selects a meaningful solution for a neural network from an enormous number of possible solutions that may severely overfit. We show that only through the lens of probabilistic stability does SGD exhibit rich and practically relevant phases of learning, such as the phases of the complete loss of stability, incorrect learning where the model captures incorrect data correlation, convergence to low-rank saddles, and correct learning where the model captures the correct correlation. These phase boundaries are precisely quantified by the Lyapunov exponents of the dynamics. The obtained phase diagrams imply that SGD prefers low-rank saddles in a neural network when the underlying gradient is noisy, thereby influencing the learning performance.
翻译:在深度学习中,表征和理解随机梯度下降(SGD)的稳定性仍是一个悬而未决的问题。常用方法是利用参数统计矩(尤其是方差)的收敛性来衡量稳定性。我们重新审视了SGD稳定性的定义,并提出使用$\textit{依概率收敛}$条件来定义SGD的$\textit{概率稳定性}$。概率稳定性为深度学习理论中的一个基本问题提供了启示:SGD如何从可能严重过拟合的庞大解空间中为神经网络选择有意义的解。我们证明,只有通过概率稳定性的视角,SGD才能展现出丰富且具有实际相关性的学习阶段,例如完全失稳阶段、模型捕获错误数据相关性的不正确学习阶段、收敛到低秩鞍点阶段,以及模型捕获正确相关性的正确学习阶段。这些阶段的边界由动力学的李雅普诺夫指数精确量化。所获得的相图表明,当梯度存在噪声时,SGD倾向于选择神经网络中的低秩鞍点,从而影响学习性能。