Modern training strategies of deep neural networks (NNs) tend to induce a heavy-tailed (HT) spectra of layer weights. Extensive efforts to study this phenomenon have found that NNs with HT weight spectra tend to generalize well. A prevailing notion for the occurrence of such HT spectra attributes gradient noise during training as a key contributing factor. Our work shows that gradient noise is unnecessary for generating HT weight spectra: two-layer NNs trained with full-batch Gradient Descent/Adam can exhibit HT spectra in their weights after finite training steps. To this end, we first identify the scale of the learning rate at which one step of full-batch Adam can lead to feature learning in the shallow NN, particularly when learning a single index teacher model. Next, we show that multiple optimizer steps with such (sufficiently) large learning rates can transition the bulk of the weight's spectra into an HT distribution. To understand this behavior, we present a novel perspective based on the singular vectors of the weight matrices and optimizer updates. We show that the HT weight spectrum originates from the `spike', which is generated from feature learning and interacts with the main bulk to generate an HT spectrum. Finally, we analyze the correlations between the HT weight spectra and generalization after multiple optimizer updates with varying learning rates.
翻译:现代深度神经网络训练策略往往会在层权重中诱导出重尾谱。针对这一现象的广泛研究发现,具有重尾权重谱的神经网络通常泛化能力良好。当前主流观点认为,训练过程中的梯度噪声是产生此类重尾谱的关键因素。本文研究表明,梯度噪声并非生成重尾权重谱的必要条件:使用全批量梯度下降法/Adam训练的双层神经网络,在有限训练步数后即可展现重尾权重谱。为此,我们首先确定了全批量Adam单步更新中能使浅层网络(特别是学习单指标教师模型时)实现特征学习的学习率量级。其次,我们证明采用此类(足够大的)学习率进行多步优化器更新,可使权重谱的主体部分转变为重尾分布。为解释该行为,我们提出基于权重矩阵奇异向量与优化器更新的全新视角,揭示重尾权重谱源于特征学习产生的"尖峰",该尖峰与主谱体相互作用最终形成重尾谱。最后,我们分析了不同学习率下多次优化器更新后,重尾权重谱与泛化性能之间的相关性。