Residual neural networks are state-of-the-art deep learning models. Their continuous-depth analog, neural ordinary differential equations (ODEs), are also widely used. Despite their success, the link between the discrete and continuous models still lacks a solid mathematical foundation. In this article, we take a step in this direction by establishing an implicit regularization of deep residual networks towards neural ODEs, for nonlinear networks trained with gradient flow. We prove that if the network is initialized as a discretization of a neural ODE, then such a discretization holds throughout training. Our results are valid for a finite training time, and also as the training time tends to infinity provided that the network satisfies a Polyak-Lojasiewicz condition. Importantly, this condition holds for a family of residual networks where the residuals are two-layer perceptrons with an overparameterization in width that is only linear, and implies the convergence of gradient flow to a global minimum. Numerical experiments illustrate our results.
翻译:残差神经网络是最先进的深度学习模型。其连续深度类比——神经常微分方程(ODEs)同样被广泛使用。尽管取得了成功,离散模型与连续模型之间的联系仍缺乏坚实的数学基础。本文通过建立深度残差网络在梯度流训练下向神经ODEs的隐式正则化,朝着该方向迈出一步。我们证明:若网络初始化为神经ODE的离散化形式,则该离散化特性在整个训练过程中得以保持。我们的结论在有限训练时间内成立,当训练时间趋于无穷时,只要网络满足Polyak-Lojasiewicz条件,结论同样成立。重要的是,该条件适用于残差层为双层感知机且仅需线性宽度过参数化的残差网络族,并保证梯度流收敛至全局最小值。数值实验验证了我们的结论。