Controlling the parameters' norm often yields good generalisation when training neural networks. Beyond simple intuitions, the relation between regularising parameters' norm and obtained estimators remains theoretically misunderstood. For one hidden ReLU layer networks with unidimensional data, this work shows the parameters' norm required to represent a function is given by the total variation of its second derivative, weighted by a $\sqrt{1+x^2}$ factor. Notably, this weighting factor disappears when the norm of bias terms is not regularised. The presence of this additional weighting factor is of utmost significance as it is shown to enforce the uniqueness and sparsity (in the number of kinks) of the minimal norm interpolator. Conversely, omitting the bias' norm allows for non-sparse solutions. Penalising the bias terms in the regularisation, either explicitly or implicitly, thus leads to sparse estimators.
翻译:在训练神经网络时,控制参数的范数通常能带来良好的泛化性能。然而,除了直观理解外,正则化参数范数与所得估计量之间的理论关系仍未被充分理解。针对单隐层ReLU网络在单变量数据上的情况,本文指出表示一个函数所需参数的范数等于其二阶导数的总变分,并乘以一个$\sqrt{1+x^2}$的权重因子。值得注意的是,当不正则化偏置项的范数时,该权重因子会消失。这一额外权重因子的存在至关重要,因为它被证明能够强制最小范数插值器的唯一性和稀疏性(体现在拐点数量上)。相反,忽略偏置项的范数则会导致非稀疏解。因此,无论显式还是隐式地惩罚正则化中的偏置项,都将产生稀疏估计量。