We investigate gradient descent training of wide neural networks and the corresponding implicit bias in function space. For univariate regression, we show that the solution of training a width-$n$ shallow ReLU network is within $n^{- 1/2}$ of the function which fits the training data and whose difference from the initial function has the smallest 2-norm of the second derivative weighted by a curvature penalty that depends on the probability distribution that is used to initialize the network parameters. We compute the curvature penalty function explicitly for various common initialization procedures. For instance, asymmetric initialization with a uniform distribution yields a constant curvature penalty, and thence the solution function is the natural cubic spline interpolation of the training data. \hj{For stochastic gradient descent we obtain the same implicit bias result.} We obtain a similar result for different activation functions. For multivariate regression we show an analogous result, whereby the second derivative is replaced by the Radon transform of a fractional Laplacian. For initialization schemes that yield a constant penalty function, the solutions are polyharmonic splines. Moreover, we show that the training trajectories are captured by trajectories of smoothing splines with decreasing regularization strength.
翻译:我们研究了宽神经网络的梯度下降训练及其在函数空间中的相应隐式偏差。对于单变量回归,我们证明:训练宽度为$n$的浅层ReLU网络的解,与拟合训练数据且与初始函数差异的修正项具有最小二阶导数加权2-范数的函数之间,误差不超过$n^{-1/2}$。该加权项取决于初始化网络参数的概率分布。我们针对多种常见初始化过程显式计算了曲率惩罚函数。例如,使用均匀分布的非对称初始化产生恒定曲率惩罚,因此解函数是训练数据的自然三次样条插值。对于随机梯度下降,我们得到了相同的隐式偏差结果。针对不同激活函数,我们也得到了类似结论。对于多变量回归,我们展示了类似结果,其中二阶导数被替换为分数阶拉普拉斯算子的拉东变换。对于产生恒定惩罚函数的初始化方案,解为多调和样条。此外,我们证明训练轨迹可被正则化强度递减的平滑样条轨迹所描述。