Feature learning is thought to be one of the fundamental reasons for the success of deep neural networks. It is rigorously known that in two-layer fully-connected neural networks under certain conditions, one step of gradient descent on the first layer followed by ridge regression on the second layer can lead to feature learning; characterized by the appearance of a separated rank-one component -- spike -- in the spectrum of the feature matrix. However, with a constant gradient descent step size, this spike only carries information from the linear component of the target function and therefore learning non-linear components is impossible. We show that with a learning rate that grows with the sample size, such training in fact introduces multiple rank-one components, each corresponding to a specific polynomial feature. We further prove that the limiting large-dimensional and large sample training and test errors of the updated neural networks are fully characterized by these spikes. By precisely analyzing the improvement in the loss, we demonstrate that these non-linear features can enhance learning.
翻译:特征学习被认为是深度神经网络成功的基本原因之一。在特定条件下,已严格证明:对于两层全连接神经网络,对第一层执行一步梯度下降后对第二层进行岭回归,可以引发特征学习——其特征是特征矩阵谱中出现一个分离的秩一分量(即“尖峰”)。然而,当梯度下降步长为常数时,该尖峰仅携带目标函数线性分量的信息,因此无法学习非线性分量。我们证明,当学习率随样本量增长时,这种训练实际上会引入多个秩一分量,每个分量对应特定的多项式特征。进一步证明,更新后神经网络的极限高维大样本训练误差与测试误差完全由这些尖峰决定。通过精确分析损失函数的改进,我们表明这些非线性特征能够增强学习效果。