Controlling the parameters' norm often yields good generalisation when training neural networks. Beyond simple intuitions, the relation between parameters' norm and obtained estimators theoretically remains misunderstood. For one hidden ReLU layer networks with unidimensional data, this work shows the minimal parameters' norm required to represent a function is given by the total variation of its second derivative, weighted by a $\sqrt{1+x^2}$ factor. As a comparison, this $\sqrt{1+x^2}$ weighting disappears when the norm of the bias terms are ignored. This additional weighting is of crucial importance, since it is shown in this work to enforce uniqueness and sparsity (in number of kinks) of the minimal norm interpolator. On the other hand, omitting the bias' norm allows for non-sparse solutions. Penalising the bias terms in the regularisation, either explicitly or implicitly, thus leads to sparse estimators. This sparsity might take part in the good generalisation of neural networks that is empirically observed.
翻译:训练神经网络时,控制参数的范数通常能带来良好的泛化能力。除简单直觉外,参数范数与所获得估计量之间的理论关系仍未得到充分理解。针对具有单维数据的单隐层ReLU网络,本研究证明了表示一个函数所需的最小参数范数由其二阶导数的全变差乘以因子$\sqrt{1+x^2}$给出。作为对比,当忽略偏置项的范数时,该$\sqrt{1+x^2}$加权会消失。这一额外加权具有关键重要性,因其在本研究中被证明能强制实现最小范数插值器的唯一性和稀疏性(以折点数衡量)。另一方面,忽略偏置范数会导致非稀疏解。因此,对正则化中的偏置项进行显式或隐式惩罚将产生稀疏估计量。这种稀疏性可能有助于神经网络在实际中观察到的良好泛化能力。