Feature learning is thought to be one of the fundamental reasons for the success of deep neural networks. It is rigorously known that in two-layer fully-connected neural networks under certain conditions, one step of gradient descent on the first layer can lead to feature learning; characterized by the appearance of a separated rank-one component -- spike -- in the spectrum of the feature matrix. However, with a constant gradient descent step size, this spike only carries information from the linear component of the target function and therefore learning non-linear components is impossible. We show that with a learning rate that grows with the sample size, such training in fact introduces multiple rank-one components, each corresponding to a specific polynomial feature. We further prove that the limiting large-dimensional and large sample training and test errors of the updated neural networks are fully characterized by these spikes. By precisely analyzing the improvement in the training and test errors, we demonstrate that these non-linear features can enhance learning.
翻译:特征学习被认为是深度神经网络成功的基本原因之一。已有严格证明表明,在特定条件下,对两层全连接神经网络的第一层执行一步梯度下降即可实现特征学习;其特征表现为特征矩阵谱中出现一个分离的秩一分量——尖峰。然而,在梯度下降步长恒定的情况下,该尖峰仅携带目标函数线性成分的信息,因此学习非线性成分是不可能的。我们证明,当学习率随样本量增长时,此类训练实际上会引入多个秩一分量,每个分量对应特定的多项式特征。我们进一步证明,更新后神经网络的极限大维数与大样本训练误差及测试误差完全由这些尖峰表征。通过对训练误差和测试误差改进的精确分析,我们论证了这些非线性特征能够促进学习。