We investigate gradient descent training of wide neural networks and the corresponding implicit bias in function space. For univariate regression, we show that the solution of training a width-$n$ shallow ReLU network is within $n^{- 1/2}$ of the function which fits the training data and whose difference from the initial function has the smallest 2-norm of the second derivative weighted by a curvature penalty that depends on the probability distribution that is used to initialize the network parameters. We compute the curvature penalty function explicitly for various common initialization procedures. For instance, asymmetric initialization with a uniform distribution yields a constant curvature penalty, and thence the solution function is the natural cubic spline interpolation of the training data. \hj{For stochastic gradient descent we obtain the same implicit bias result.} We obtain a similar result for different activation functions. For multivariate regression we show an analogous result, whereby the second derivative is replaced by the Radon transform of a fractional Laplacian. For initialization schemes that yield a constant penalty function, the solutions are polyharmonic splines. Moreover, we show that the training trajectories are captured by trajectories of smoothing splines with decreasing regularization strength.
翻译:我们研究了宽神经网络的梯度下降训练及其在函数空间中的相应隐式偏差。对于单变量回归,我们证明:训练宽度为$n$的浅层ReLU网络得到的解,在$n^{-1/2}$精度范围内逼近一个函数,该函数拟合训练数据,且其与初始函数的差异具有最小的二阶导数2-范数(该范数受曲率惩罚项加权,曲率惩罚依赖于初始化网络参数所使用的概率分布)。我们针对各种常见初始化过程显式计算了曲率惩罚函数。例如,均匀分布的非对称初始化产生恒定曲率惩罚,此时解函数即为训练数据的自然三次样条插值。对于随机梯度下降,我们获得了相同的隐式偏差结果。对于不同激活函数,我们得到类似结论。对于多变量回归,我们展示了类似结果,其中二阶导数由分数阶拉普拉斯的Radon变换替代。对于产生恒定惩罚函数的初始化方案,其解为多调和样条。此外,我们证明训练轨迹可被正则化强度递减的光滑样条轨迹所刻画。