Larger and deeper networks generalise well despite their increased capacity to overfit. Understanding why this happens is theoretically and practically important. One recent approach looks at the infinitely wide limits of such networks and their corresponding kernels. However, these theoretical tools cannot fully explain finite networks as the empirical kernel changes significantly during gradient-descent-based training in contrast to infinite networks. In this work, we derive an iterative linearised training method as a novel empirical tool to further investigate this distinction, allowing us to control for sparse (i.e. infrequent) feature updates and quantify the frequency of feature learning needed to achieve comparable performance. We justify iterative linearisation as an interpolation between a finite analog of the infinite width regime, which does not learn features, and standard gradient descent training, which does. Informally, we also show that it is analogous to a damped version of the Gauss-Newton algorithm -- a second-order method. We show that in a variety of cases, iterative linearised training surprisingly performs on par with standard training, noting in particular how much less frequent feature learning is required to achieve comparable performance. We also show that feature learning is essential for good performance. Since such feature learning inevitably causes changes in the NTK kernel, we provide direct negative evidence for the NTK theory, which states the NTK kernel remains constant during training.
翻译:更大的网络和更深的网络尽管过拟合能力增强,但泛化性能依然良好。理解这一现象在理论和实践上都具有重要意义。近期一种方法通过研究此类网络的无穷宽极限及其对应核函数来探索这一现象。然而,这些理论工具无法完全解释有限网络,因为与无穷宽网络不同,有限网络在基于梯度下降的训练过程中其经验核函数会发生显著变化。在本工作中,我们推导出一种迭代线性化训练方法,作为进一步探究这一差异的新型实证工具,从而能够控制稀疏(即低频)特征更新,并量化实现可比性能所需的特征学习频率。我们将迭代线性化论证为一种介于不学习特征的有限宽度区域模拟与学习特征的标准梯度下降训练之间的插值方法。非正式地,我们还证明该方法相当于高斯-牛顿算法(一种二阶优化方法)的阻尼版本。我们在多种场景下发现,迭代线性化训练出人意料地能够达到与标准训练相当的性能,尤其值得注意的是实现可比性能所需的特征学习频率远低于预期。同时我们也证明特征学习对良好性能至关重要。由于特征学习必然导致NTK核函数的变化,我们提供了NTK理论的直接反证——该理论认为NTK核函数在训练过程中保持恒定。