This paper investigates low-rank structure in the gradients of the training loss for two-layer neural networks while relaxing the usual isotropy assumptions on the training data and parameters. We consider a spiked data model in which the bulk can be anisotropic and ill-conditioned, we do not require independent data and weight matrices and we also analyze both the mean-field and neural-tangent-kernel scalings. We show that the gradient with respect to the input weights is approximately low rank and is dominated by two rank-one terms: one aligned with the bulk data-residue , and another aligned with the rank one spike in the input data. We characterize how properties of the training data, the scaling regime and the activation function govern the balance between these two components. Additionally, we also demonstrate that standard regularizers, such as weight decay, input noise and Jacobian penalties, also selectively modulate these components. Experiments on synthetic and real data corroborate our theoretical predictions.
翻译:本文研究了两层神经网络训练损失梯度中的低秩结构,同时放宽了通常对训练数据和参数的各向同性假设。我们采用尖峰数据模型,其中主体部分可以是各向异性且病态的,不要求数据与权重矩阵相互独立,并同时分析了平均场与神经正切核两种缩放机制。研究表明,输入权重对应的梯度近似为低秩,且主要由两个秩一项主导:一项与主体数据残差对齐,另一项与输入数据中的秩一尖峰对齐。我们刻画了训练数据特性、缩放机制及激活函数如何调控这两个分量的平衡关系。此外,研究还证明标准正则化方法(如权重衰减、输入噪声和雅可比惩罚)也会选择性地调节这些分量。合成数据与真实数据的实验验证了理论预测。