Canonical Regularisation of Wide Feature-Learning Neural Networks

Wide neural networks in the feature-learning regime drive modern deep learning, and yet they remain far less studied than their kernel-regime counterparts. We consider a critical yet under-explored difference between these two regimes: the regulariser and prior implied by gradient flow training. This canonical regularisation property is well-studied in kernel regime networks -- of all the infinite global minima, gradient flow selects exactly the vanishing ridge solution -- and underpins the celebrated NN-GP correspondence, precisely allowing the modelling of noise during training. However, we prove ridge regularisation biases gradient flow in feature-learning regime networks, even in the infinitesimal limit of vanishing regularisation. Over training, ridge distorts the inductive bias of the network, with a particular damage done to pretrained networks where the implicit prior is informative. We resolve this by axiomatising the canonical regulariser as a regime-agnostic function-space energy and lift, which uniquely identifies ridge in the kernel regime, and crucially generalises to the feature-learning regime. By studying the Riemannian geometry of feature-learning networks, we derive geodesic ridge from our framework, generalising ridge to the feature-learning regime. Correspondingly, we prove the canonical function-space prior is a Riemannian Gibbs Process, generalising the more familiar Gaussian Process. As a practical contribution, we propose arc ridge as a minimax-robust, scalable surrogate to geodesic ridge, revealing a deep relationship between early stopping and canonical regularisation across learning regimes. Finally, we demonstrate the consequences of our theory empirically on both image processing and NLP transfer-learning problems.

翻译：在特征学习机制中运行的宽神经网络驱动着现代深度学习，然而与核机制网络相比，这类网络的研究仍远不充分。我们关注这两个机制间一个关键但尚未充分探索的差异：由梯度流训练所隐含的正则化器与先验。这种规范正则化性质在核机制网络中已得到充分研究——在所有无穷全局极小值中，梯度流恰好选择消失岭解——并且这一性质支撑着著名的NN-GP对应关系，精确允许对训练过程中噪声的建模。然而，我们证明岭正则化会使特征学习机制网络中的梯度流产生偏差，即便在正则化趋近于零的无穷小极限下也是如此。在训练过程中，岭会扭曲网络的归纳偏置，对隐式先验包含信息的预训练网络造成特别损害。我们将规范正则化器公理化定义为与机制无关的函数空间能量和提升算子，该定义在核机制下唯一确定岭正则化，并关键性地推广至特征学习机制。通过研究特征学习网络的黎曼几何，我们从该框架推导出测地岭，将岭正则化推广至特征学习机制。相应地，我们证明规范函数空间先验是黎曼吉布斯过程，推广了更常见的高斯过程。作为实践贡献，我们提出弧岭作为测地岭的极小极大鲁棒可扩展替代方案，揭示了不同学习机制下早停与规范正则化之间的深层关系。最后，我们在图像处理和自然语言处理迁移学习问题上通过实验验证了理论结论。