Stochastic gradient descent (SGD) has become a cornerstone of neural network optimization, yet the noise introduced by SGD is often assumed to be uncorrelated over time, despite the ubiquity of epoch-based training. In this work, we challenge this assumption and investigate the effects of epoch-based noise correlations on the stationary distribution of discrete-time SGD with momentum, limited to a quadratic loss. Our main contributions are twofold: first, we calculate the exact autocorrelation of the noise for training in epochs under the assumption that the noise is independent of small fluctuations in the weight vector; second, we explore the influence of correlations introduced by the epoch-based learning scheme on SGD dynamics. We find that for directions with a curvature greater than a hyperparameter-dependent crossover value, the results for uncorrelated noise are recovered. However, for relatively flat directions, the weight variance is significantly reduced. We provide an intuitive explanation for these results based on a crossover between correlation times, contributing to a deeper understanding of the dynamics of SGD in the presence of epoch-based noise correlations.
翻译:随机梯度下降(SGD)已成为神经网络优化的基石,然而,尽管基于历元的训练普遍存在,SGD引入的噪声通常被假设为在时间上不相关。本文质疑这一假设,并研究基于历元的噪声相关性对有限制于二次损失的离散时间带动量SGD平稳分布的影响。我们的主要贡献有两点:首先,在假设噪声与权重向量的小波动无关的前提下,我们精确计算了历元训练中噪声的自相关;其次,我们探讨了历元学习方案引入的相关性对SGD动力学的影响。我们发现,对于曲率大于超参数依赖的交叉值的方向,不相关噪声的结果得以恢复。然而,对于相对平坦的方向,权重方差显著减小。我们基于相关时间之间的交叉对这些结果提供了直观解释,从而有助于更深入理解存在历元噪声相关性时SGD的动力学。