Feature learning is thought to be one of the fundamental reasons for the success of deep neural networks. It is rigorously known that in two-layer fully-connected neural networks under certain conditions, one step of gradient descent on the first layer followed by ridge regression on the second layer can lead to feature learning; characterized by the appearance of a separated rank-one component -- spike -- in the spectrum of the feature matrix. However, with a constant gradient descent step size, this spike only carries information from the linear component of the target function and therefore learning non-linear components is impossible. We show that with a learning rate that grows with the sample size, such training in fact introduces multiple rank-one components, each corresponding to a specific polynomial feature. We further prove that the limiting large-dimensional and large sample training and test errors of the updated neural networks are fully characterized by these spikes. By precisely analyzing the improvement in the training and test errors, we demonstrate that these non-linear features can enhance learning.
翻译:特征学习被认为是深度神经网络成功的基本原因之一。在特定条件下,严格已知对于两层全连接神经网络,对第一层进行一次梯度下降后对第二层进行岭回归可引发特征学习;其特征为特征矩阵谱中出现分离的秩一分量(即“尖峰”)。然而,在恒定梯度下降步长下,该尖峰仅携带目标函数线性分量的信息,因此无法学习非线性分量。我们证明,当学习率随样本量增长时,这种训练实际上会引入多个秩一分量,每个分量对应特定的多项式特征。我们进一步证明,更新后神经网络的极限大维大样本训练误差与测试误差完全由这些尖峰刻画。通过精确分析训练误差与测试误差的改进,我们证明这些非线性特征能够增强学习能力。