Controlling the parameters' norm often yields good generalisation when training neural networks. Beyond simple intuitions, the relation between regularising parameters' norm and obtained estimators remains theoretically misunderstood. For one hidden ReLU layer networks with unidimensional data, this work shows the parameters' norm required to represent a function is given by the total variation of its second derivative, weighted by a $\sqrt{1+x^2}$ factor. Notably, this weighting factor disappears when the norm of bias terms is not regularised. The presence of this additional weighting factor is of utmost significance as it is shown to enforce the uniqueness and sparsity (in the number of kinks) of the minimal norm interpolator. Conversely, omitting the bias' norm allows for non-sparse solutions. Penalising the bias terms in the regularisation, either explicitly or implicitly, thus leads to sparse estimators.
翻译:在训练神经网络时,控制参数的范数通常能带来良好的泛化性能。然而,除了直观理解外,正则化参数范数与所得估计量之间的关系在理论上仍未被充分认识。针对具有一维数据的单隐藏层ReLU网络,本文表明表示一个函数所需的参数范数由其二阶导数的全变差决定,并乘以一个$\sqrt{1+x^2}$因子。值得注意的是,当偏差项的范数未被正则化时,该加权因子会消失。这一额外加权因子的存在至关重要,因为它被证明能够强制最小范数插值器的唯一性和稀疏性(体现在拐点数量上)。相反,忽略偏差范数会导致非稀疏解。因此,在正则化中惩罚偏差项(无论是显式还是隐式)都会产生稀疏估计量。