Stochastic gradient descent (SGD) has become a cornerstone of neural network optimization, yet the noise introduced by SGD is often assumed to be uncorrelated over time, despite the ubiquity of epoch-based training. In this work, we challenge this assumption and investigate the effects of epoch-based noise correlations on the stationary distribution of discrete-time SGD with momentum, limited to a quadratic loss. Our main contributions are twofold: first, we calculate the exact autocorrelation of the noise for training in epochs under the assumption that the noise is independent of small fluctuations in the weight vector, and find that SGD noise is anti-correlated in time. Second, we explore the influence of these anti-correlations on SGD dynamics. We find that for directions with a curvature greater than a hyperparameter-dependent crossover value, the results for uncorrelated noise are recovered. However, for relatively flat directions, the weight variance is significantly reduced, and our variance prediction leads to a considerable reduction in loss fluctuations as compared to the constant weight variance assumption.
翻译:随机梯度下降(SGD)已成为神经网络优化的基石,然而,尽管基于轮次的训练普遍存在,SGD引入的噪声通常被假设为随时间不相关。在这项工作中,我们挑战了这一假设,并研究了基于轮次的噪声相关性对带动量的离散时间SGD在二次损失下的平稳分布的影响。我们的主要贡献有两个方面:首先,在假设噪声与权重向量的小波动无关的前提下,我们计算了轮次训练中噪声的精确自相关性,发现SGD噪声在时间上是反相关的。其次,我们探讨了这些反相关性对SGD动力学的影响。我们发现,对于曲率大于一个依赖于超参数的交叉值的方向,不相关噪声的结果得以恢复。然而,对于相对平坦的方向,权重方差显著减小,并且与恒定权重方差假设相比,我们的方差预测导致损失波动大幅降低。