The stochastic gradient descent (SGD) algorithm is the algorithm we use to train neural networks. However, it remains poorly understood how the SGD navigates the highly nonlinear and degenerate loss landscape of a neural network. In this work, we prove that the minibatch noise of SGD regularizes the solution towards a balanced solution whenever the loss function contains a rescaling symmetry. Because the difference between a simple diffusion process and SGD dynamics is the most significant when symmetries are present, our theory implies that the loss function symmetries constitute an essential probe of how SGD works. We then apply this result to derive the stationary distribution of stochastic gradient flow for a diagonal linear network with arbitrary depth and width. The stationary distribution exhibits complicated nonlinear phenomena such as phase transitions, broken ergodicity, and fluctuation inversion. These phenomena are shown to exist uniquely in deep networks, implying a fundamental difference between deep and shallow models.
翻译:随机梯度下降(SGD)算法是我们训练神经网络所使用的算法。然而,SGD如何在高度非线性和退化的神经网络损失景观中导航,至今仍未被充分理解。在这项工作中,我们证明了当损失函数包含重标度对称性时,SGD的小批量噪声会将解正则化至平衡解。由于对称性存在时,简单扩散过程与SGD动力学之间的差异最为显著,我们的理论表明损失函数对称性构成了理解SGD工作机制的重要探针。我们进而将这一结果应用于推导任意深度和宽度的对角线性网络中随机梯度流的稳态分布。该稳态分布呈现了复杂的非线性现象,如相变、遍历性破缺和涨落反转。这些现象被证明是深度网络所特有的,揭示了深度模型与浅层模型之间的根本差异。