Large neural networks have proved remarkably effective in modern deep learning practice, even in the overparametrized regime where the number of active parameters is large relative to the sample size. This contradicts the classical perspective that a machine learning model must trade off bias and variance for optimal generalization. To resolve this conflict, we present a nonasymptotic generalization theory for two-layer neural networks with ReLU activation function by incorporating scaled variation regularization. Interestingly, the regularizer is equivalent to ridge regression from the angle of gradient-based optimization, but plays a similar role to the group lasso in controlling the model complexity. By exploiting this "ridge-lasso duality," we obtain new prediction bounds for all network widths, which reproduce the double descent phenomenon. Moreover, the overparametrized minimum risk is lower than its underparametrized counterpart when the signal is strong, and is nearly minimax optimal over a suitable class of functions. By contrast, we show that overparametrized random feature models suffer from the curse of dimensionality and thus are suboptimal.
翻译:大型神经网络在现代深度学习实践中表现出显著的有效性,即使在过参数化情况下(即活跃参数数量相对于样本量较大时)也是如此。这违背了经典观点——机器学习模型必须在偏差和方差之间进行权衡以实现最优泛化。为解决这一矛盾,我们通过引入缩放变异正则化,提出了具有ReLU激活函数的两层神经网络的非渐近泛化理论。值得注意的是,从基于梯度的优化角度来看,该正则化等价于岭回归,但在控制模型复杂度方面起到与组套索类似的作用。通过利用这种"岭-套索对偶性",我们获得了所有网络宽度的新预测界,这再现了双下降现象。此外,当信号较强时,过参数化最小风险低于欠参数化最小风险,并且在合适的函数类上接近极小极大最优。相比之下,我们证明了过参数化随机特征模型遭受维数灾难,因此是次优的。