Is Stochastic Gradient Descent Effective? A PDE Perspective on Machine Learning processes

In this paper we analyze the behaviour of the stochastic gradient descent (SGD), a widely used method in supervised learning for optimizing neural network weights via a minimization of non-convex loss functions. Since the pioneering work of E, Li and Tai (2017), the underlying structure of such processes can be understood via parabolic PDEs of Fokker-Planck type, which are at the core of our analysis. Even if Fokker-Planck equations have a long history and a extensive literature, almost nothing is known when the potential is non-convex or when the diffusion matrix is degenerate, and this is the main difficulty that we face in our analysis. We identify two different regimes: in the initial phase of SGD, the loss function drives the weights to concentrate around the nearest local minimum. We refer to this phase as the drift regime and we provide quantitative estimates on this concentration phenomenon. Next, we introduce the diffusion regime, where stochastic fluctuations help the learning process to escape suboptimal local minima. We analyze the Mean Exit Time (MET) and prove upper and lower bounds of the MET. Finally, we address the asymptotic convergence of SGD, for a non-convex cost function and a degenerate diffusion matrix, that do not allow to use the standard approaches, and require new techniques. For this purpose, we exploit two different methods: duality and entropy methods. We provide new results about the dynamics and effectiveness of SGD, offering a deep connection between stochastic optimization and PDE theory, and some answers and insights to basic questions in the Machine Learning processes: How long does SGD take to escape from a bad minimum? Do neural network parameters converge using SGD? How do parameters evolve in the first stage of training with SGD?

翻译：本文分析了随机梯度下降（SGD）的行为，这是一种在监督学习中通过最小化非凸损失函数来优化神经网络权重的广泛使用方法。自E、Li和Tai（2017）的开创性工作以来，此类过程的底层结构可通过Fokker-Planck型抛物偏微分方程理解，这构成了我们分析的核心。尽管Fokker-Planck方程历史悠久且文献丰富，但当势能非凸或扩散矩阵退化时，几乎无人知晓其结果，而这正是我们分析中面临的主要困难。我们识别出两种不同阶段：在SGD的初始阶段，损失函数驱使权重集中在最近的局部最小值附近。我们将此阶段称为漂移阶段，并对此集中现象提供了定量估计。随后，我们引入扩散阶段，其中随机波动帮助学习过程逃离次优局部最小值。我们分析了平均逃逸时间（MET），并证明了MET的上界和下界。最后，我们研究了SGD的渐近收敛性，针对非凸代价函数和退化扩散矩阵，这些因素使得标准方法无法适用，需要新技术。为此，我们采用了两种不同方法：对偶方法和熵方法。我们提供了关于SGD动力学和有效性的新结果，揭示了随机优化与PDE理论之间的深层联系，并为机器学习过程中的基本问题提供了答案和见解：SGD需要多长时间才能逃离一个坏的极小值？使用SGD时神经网络参数是否收敛？在SGD训练的第一阶段，参数如何演化？