Stochastic gradients closely relate to both optimization and generalization of deep neural networks (DNNs). Some works attempted to explain the success of stochastic optimization for deep learning by the arguably heavy-tail properties of gradient noise, while other works presented theoretical and empirical evidence against the heavy-tail hypothesis on gradient noise. Unfortunately, formal statistical tests for analyzing the structure and heavy tails of stochastic gradients in deep learning are still under-explored. In this paper, we mainly make two contributions. First, we conduct formal statistical tests on the distribution of stochastic gradients and gradient noise across both parameters and iterations. Our statistical tests reveal that dimension-wise gradients usually exhibit power-law heavy tails, while iteration-wise gradients and stochastic gradient noise caused by minibatch training usually do not exhibit power-law heavy tails. Second, we further discover that the covariance spectra of stochastic gradients have the power-law structures overlooked by previous studies and present its theoretical implications for training of DNNs. While previous studies believed that the anisotropic structure of stochastic gradients matters to deep learning, they did not expect the gradient covariance can have such an elegant mathematical structure. Our work challenges the existing belief and provides novel insights on the structure of stochastic gradients in deep learning.
翻译:随机梯度与深度神经网络(DNNs)的优化和泛化密切相关。部分研究尝试通过梯度噪声的所谓重尾特性来解释深度学习随机优化的成功,而另一些研究则从理论和实证角度提出了反对梯度噪声重尾假说的证据。遗憾的是,目前仍缺乏用于分析深度学习随机梯度结构及其重尾性的正式统计检验方法。本文主要作出两项贡献:第一,我们对参数维度和迭代维度上的随机梯度及梯度噪声分布进行了正式统计检验。检验结果表明,维度梯度通常呈现幂律重尾分布,而迭代梯度及小批量训练导致的随机梯度噪声通常不呈现幂律重尾分布。第二,我们进一步发现随机梯度的协方差谱具有被先前研究忽视的幂律结构,并揭示了该结构对DNN训练的理论意义。尽管先前研究认为随机梯度的各向异性结构对深度学习至关重要,但尚未预料到梯度协方差能具有如此优雅的数学结构。我们的工作挑战了现有认知,为深度学习随机梯度的结构提供了新的见解。