We study the limiting dynamics of a large class of noisy gradient descent systems in the overparameterized regime. In this regime the set of global minimizers of the loss is large, and when initialized in a neighbourhood of this zero-loss set a noisy gradient descent algorithm slowly evolves along this set. In some cases this slow evolution has been related to better generalisation properties. We characterize this evolution for the broad class of noisy gradient descent systems in the limit of small step size. Our results show that the structure of the noise affects not just the form of the limiting process, but also the time scale at which the evolution takes place. We apply the theory to Dropout, label noise and classical SGD (minibatching) noise, and show that these evolve on different two time scales. Classical SGD even yields a trivial evolution on both time scales, implying that additional noise is required for regularization. The results are inspired by the training of neural networks, but the theorems apply to noisy gradient descent of any loss that has a non-trivial zero-loss set.
翻译:我们研究了一类大规模含噪声梯度下降系统在过参数化机制下的极限动力学行为。在该机制下,损失函数全局极小值点集合规模庞大,当系统初始化于该零损失集合邻域时,含噪声梯度下降算法将沿该集合缓慢演化。已有研究表明,这种缓慢演化行为与更优的泛化特性存在关联。我们针对步长趋近零的极限条件下,广泛类别含噪声梯度下降系统的演化特征进行了刻画。研究结果表明,噪声结构不仅影响极限过程的形式,还决定了演化发生的时间尺度。我们将该理论应用于Dropout、标签噪声以及经典SGD(小批量)噪声场景,发现这些系统在两个不同的时间尺度上演化,而经典SGD在两个时间尺度上均呈现平凡演化,说明需要额外噪声才能实现正则化效果。本研究受神经网络训练启发,但所提出定理适用于具有非平凡零损失集合的任意含噪声梯度下降系统。