It has repeatedly been observed that loss minimization by stochastic gradient descent (SGD) leads to heavy-tailed distributions of neural network parameters. Here, we analyze a continuous diffusion approximation of SGD, called homogenized stochastic gradient descent, show that it behaves asymptotically heavy-tailed, and give explicit upper and lower bounds on its tail-index. We validate these bounds in numerical experiments and show that they are typically close approximations to the empirical tail-index of SGD iterates. In addition, their explicit form enables us to quantify the interplay between optimization parameters and the tail-index. Doing so, we contribute to the ongoing discussion on links between heavy tails and the generalization performance of neural networks as well as the ability of SGD to avoid suboptimal local minima.
翻译:已有研究反复表明,随机梯度下降(SGD)实现的损失最小化会导致神经网络参数呈现重尾分布。本文分析了一种称为齐次随机梯度下降的SGD连续扩散近似方法,证明其渐近表现为重尾分布,并给出了尾指数的显式上下界。我们通过数值实验验证了这些界,并证明它们通常能很好地近似SGD迭代的经验尾指数。此外,其显式形式使我们能够量化优化参数与尾指数之间的相互作用。借此,我们为当前关于重尾与神经网络泛化性能以及SGD避免次优局部最小值能力之间联系的讨论做出了贡献。