Residual neural networks are state-of-the-art deep learning models. Their continuous-depth analog, neural ordinary differential equations (ODEs), are also widely used. Despite their success, the link between the discrete and continuous models still lacks a solid mathematical foundation. In this article, we take a step in this direction by establishing an implicit regularization of deep residual networks towards neural ODEs, for nonlinear networks trained with gradient flow. We prove that if the network is initialized as a discretization of a neural ODE, then such a discretization holds throughout training. Our results are valid for a finite training time, and also as the training time tends to infinity provided that the network satisfies a Polyak-Lojasiewicz condition. Importantly, this condition holds for a family of residual networks where the residuals are two-layer perceptrons with an overparameterization in width that is only linear, and implies the convergence of gradient flow to a global minimum. Numerical experiments illustrate our results.
翻译:残差神经网络是当前最先进的深度学习模型,其连续深度类比——神经常微分方程(ODE)同样被广泛应用。尽管这些模型取得了成功,但离散模型与连续模型之间的数学联系仍缺乏坚实理论基础。本文迈出了该方向的重要一步:针对通过梯度流训练的非线性网络,建立了深度残差网络向神经常微分方程的隐式正则化机制。我们证明,若网络初始化为神经常微分方程的离散形式,则该离散形式在整个训练过程中得以保持。该结论在有限训练时间内成立,当训练时间趋于无穷时,若网络满足Polyak-Lojasiewicz条件,结论依然成立。关键之处在于,该条件对一类残差网络成立(其残差模块为两层感知机,且宽度仅需线性超参数化),并保证了梯度流收敛至全局最小值。数值实验验证了我们的理论结果。