Stochastic gradients closely relate to both optimization and generalization of deep neural networks (DNNs). Some works attempted to explain the success of stochastic optimization for deep learning by the arguably heavy-tail properties of gradient noise, while other works presented theoretical and empirical evidence against the heavy-tail hypothesis on gradient noise. Unfortunately, formal statistical tests for analyzing the structure and heavy tails of stochastic gradients in deep learning are still under-explored. In this paper, we mainly make two contributions. First, we conduct formal statistical tests on the distribution of stochastic gradients and gradient noise across both parameters and iterations. Our statistical tests reveal that dimension-wise gradients usually exhibit power-law heavy tails, while iteration-wise gradients and stochastic gradient noise caused by minibatch training usually do not exhibit power-law heavy tails. Second, we further discover that the covariance spectra of stochastic gradients have the power-law structures overlooked by previous studies and present its theoretical implications for training of DNNs. While previous studies believed that the anisotropic structure of stochastic gradients matters to deep learning, they did not expect the gradient covariance can have such an elegant mathematical structure. Our work challenges the existing belief and provides novel insights on the structure of stochastic gradients in deep learning.
翻译:随机梯度与深度神经网络的优化和泛化密切相关。有研究试图通过梯度噪声的所谓重尾特性来解释深度学习随机优化的成功,而另一些研究则提供了反对梯度噪声重尾假设的理论和实验证据。遗憾的是,目前对深度学习随机梯度结构与重尾特征进行形式化统计检验的研究仍显不足。本文主要做出两项贡献:第一,我们对参数维度和迭代维度的随机梯度及梯度噪声分布进行了形式化统计检验。统计检验表明,维度上的梯度通常呈现幂律重尾分布,而迭代上的梯度及小批量训练产生的随机梯度噪声通常不呈现幂律重尾特性。第二,我们进一步发现随机梯度的协方差谱具有被前人研究忽视的幂律结构,并阐述了其对深度神经网络训练的理论意义。尽管以往研究认为随机梯度的各向异性结构对深度学习至关重要,但并未预期梯度协方差具备如此优美的数学结构。本研究挑战了现有认知,为深度学习随机梯度的结构提供了全新见解。