In this manuscript we investigate the problem of how two-layer neural networks learn features from data, and improve over the kernel regime, after being trained with a single gradient descent step. Leveraging a connection from (Ba et al., 2022) with a non-linear spiked matrix model and recent progress on Gaussian universality (Dandi et al., 2023), we provide an exact asymptotic description of the generalization error in the high-dimensional limit where the number of samples $n$, the width $p$ and the input dimension $d$ grow at a proportional rate. We characterize exactly how adapting to the data is crucial for the network to efficiently learn non-linear functions in the direction of the gradient -- where at initialization it can only express linear functions in this regime. To our knowledge, our results provides the first tight description of the impact of feature learning in the generalization of two-layer neural networks in the large learning rate regime $\eta=\Theta_{d}(d)$, beyond perturbative finite width corrections of the conjugate and neural tangent kernels.
翻译:本文研究了经单步梯度下降训练后,两层神经网络如何从数据中学习特征并超越核机制的问题。通过利用Ba等人(2022)建立的非线性尖峰矩阵模型与Dandi等人(2023)关于高斯普适性的最新进展之间的关联,我们给出了在高维极限下(样本数$n$、宽度$p$和输入维度$d$按比例增长时)泛化误差的精确渐近描述。我们精确刻画了自适应数据性对于网络沿梯度方向高效学习非线性函数的关键作用——在该机制下,网络初始化时仅能表达线性函数。据我们所知,我们的结果首次在网络大学习率$\eta=\Theta_{d}(d)$范围内(超越共轭核与神经正切核的有限宽度微扰修正),提供了特征学习对两层神经网络泛化影响的严格定量刻画。