Understanding when the noise in stochastic gradient descent (SGD) affects generalization of deep neural networks remains a challenge, complicated by the fact that networks can operate in distinct training regimes. Here we study how the magnitude of this noise $T$ affects performance as the size of the training set $P$ and the scale of initialization $\alpha$ are varied. For gradient descent, $\alpha$ is a key parameter that controls if the network is `lazy' ($\alpha\gg 1$) or instead learns features ($\alpha\ll 1$). For classification of MNIST and CIFAR10 images, our central results are: (i) obtaining phase diagrams for performance in the $(\alpha,T)$ plane. They show that SGD noise can be detrimental or instead useful depending on the training regime. Moreover, although increasing $T$ or decreasing $\alpha$ both allow the net to escape the lazy regime, these changes can have opposite effects on performance. (ii) Most importantly, we find that key dynamical quantities (including the total variations of weights during training) depend on both $T$ and $P$ as power laws, and the characteristic temperature $T_c$, where the noise of SGD starts affecting performance, is a power law of $P$. These observations indicate that a key effect of SGD noise occurs late in training, by affecting the stopping process whereby all data are fitted. We argue that due to SGD noise, nets must develop a stronger `signal', i.e. larger informative weights, to fit the data, leading to a longer training time. The same effect occurs at larger training set $P$. We confirm this view in the perceptron model, where signal and noise can be precisely measured. Interestingly, exponents characterizing the effect of SGD depend on the density of data near the decision boundary, as we explain.
翻译:理解随机梯度下降(SGD)中的噪声何时影响深度神经网络的泛化能力仍是一个挑战,其复杂性源于网络可在不同训练机制下运行。本文研究了当训练集规模$P$与初始化尺度$\alpha$变化时,该噪声大小$T$对性能的影响。对于梯度下降而言,$\alpha$是控制网络处于“惰性”态($\alpha\gg 1$)或特征学习态($\alpha\ll 1$)的关键参数。针对MNIST和CIFAR10图像分类任务,我们的核心发现如下:(i)获得了$(\alpha,T)$平面内的性能相图,表明SGD噪声的损害或助益作用取决于训练机制。此外,尽管增大$T$或减小$\alpha$均能使网络逃离惰性态,但这些变化对性能的影响可能相反。(ii)最重要的是,我们发现关键动力学量(包括训练期间权重的总变差)随$T$和$P$呈幂律变化,且SGD噪声开始影响性能的特征温度$T_c$是$P$的幂律函数。这些现象表明,SGD噪声的关键效应发生在训练后期——通过影响拟合所有数据的停止过程。我们论证:由于SGD噪声的存在,网络必须形成更强的“信号”(即更大的信息性权重)以拟合数据,从而导致更长的训练时间。更大的训练集$P$也会引发相同效应。我们在感知机模型中验证了这一观点——该模型可精确测量信号与噪声。有趣的是,SGD效应的特征指数取决于决策边界附近的数据密度(这一点已由我们阐明)。