Larger and deeper networks generalise well despite their increased capacity to overfit. Understanding why this happens is theoretically and practically important. One approach has been to look at the infinitely wide limits of such networks. However, these cannot fully explain finite networks as they do not learn features and the empirical kernel changes significantly during training in contrast to infinite networks. In this work, we derive an iterative linearised training method to investigate this distinction, allowing us to control for sparse (i.e. infrequent) feature updates and quantify the frequency of feature learning needed to achieve comparable performance. We justify iterative linearisation as an interpolation between a finite analog of the infinite width regime, which does not learn features, and standard gradient descent training, which does. We also show that it is analogous to a damped version of the Gauss-Newton algorithm -- a second-order method. We show that in a variety of cases, iterative linearised training performs on par with standard training, noting in particular how much less frequent feature learning is required to achieve comparable performance. We also show that feature learning is essential for good performance. Since such feature learning inevitably causes changes in the NTK kernel, it provides direct negative evidence for the NTK theory, which states the NTK kernel remains constant during training.
翻译:更大更深的网络尽管过拟合能力增强,但泛化性能依然良好。理解这一现象具有重要的理论和实践意义。一种研究思路是考察此类网络的无限宽极限。然而,由于这些网络在学习特征时与无限网络存在本质差异,且训练过程中经验核函数发生显著变化,因此无法完全解释有限网络的行为。本研究提出了一种迭代线性化训练方法,通过控制稀疏(即低频)特征更新,量化达到可比性能所需的特征学习频率。我们将迭代线性化定位为介于不学习特征的有限模拟无限宽域与学习特征的标准梯度下降训练之间的插值方法,并证明其等价于高斯-牛顿算法的阻尼版本——即一种二阶方法。实验表明,在多种场景下,迭代线性化训练性能与标准训练相当,尤其值得注意的是,达到可比性能所需的特征学习频率显著降低。研究同时证实,特征学习对获得良好性能至关重要。由于特征学习必然导致NTK核函数发生变化,这为NTK理论(认为训练过程中NTK核函数保持恒定)提供了直接的反面证据。