A recent line of empirical studies has demonstrated that SGD might exhibit a heavy-tailed behavior in practical settings, and the heaviness of the tails might correlate with the overall performance. In this paper, we investigate the emergence of such heavy tails. Previous works on this problem only considered, up to our knowledge, online (also called single-pass) SGD, in which the emergence of heavy tails in theoretical findings is contingent upon access to an infinite amount of data. Hence, the underlying mechanism generating the reported heavy-tailed behavior in practical settings, where the amount of training data is finite, is still not well-understood. Our contribution aims to fill this gap. In particular, we show that the stationary distribution of offline (also called multi-pass) SGD exhibits 'approximate' power-law tails and the approximation error is controlled by how fast the empirical distribution of the training data converges to the true underlying data distribution in the Wasserstein metric. Our main takeaway is that, as the number of data points increases, offline SGD will behave increasingly 'power-law-like'. To achieve this result, we first prove nonasymptotic Wasserstein convergence bounds for offline SGD to online SGD as the number of data points increases, which can be interesting on their own. Finally, we illustrate our theory on various experiments conducted on synthetic data and neural networks.
翻译:一系列最新的实证研究表明,随机梯度下降(SGD)在实际场景中可能表现出重尾行为,且尾部的厚重程度可能与整体性能相关。本文旨在探究此类重尾现象的产生机制。据我们所知,此前关于该问题的研究仅关注在线(亦称单轮)随机梯度下降,其理论发现中重尾的出现依赖于无限数据量的访问条件。因此,在训练数据量有限的现实场景中,实际观察到的重尾行为之生成机制仍未被充分理解。我们的研究旨在填补这一空白。具体而言,我们证明离线(亦称多轮)随机梯度下降的平稳分布呈现“近似”幂律尾部,其近似误差由训练数据经验分布与真实数据分布在Wasserstein度量下的收敛速度控制。主要结论是:随着数据点数量的增加,离线随机梯度下降会逐渐趋近于“类幂律”行为。为获得这一结果,我们首先证明了随着数据点增多,离线SGD向在线SGD收敛的非渐近Wasserstein界——这一结论本身亦具学术价值。最后,我们在合成数据及神经网络的多项实验中验证了上述理论。