In this paper, we find and analyze that we can easily drop the double descent by only adding one dropout layer before the fully-connected linear layer. The surprising double-descent phenomenon has drawn public attention in recent years, making the prediction error rise and drop as we increase either sample or model size. The current paper shows that it is possible to alleviate these phenomena by using optimal dropout in the linear regression model and the nonlinear random feature regression, both theoretically and empirically. % ${y}=X{\beta}^0+{\epsilon}$ with $X\in\mathbb{R}^{n\times p}$. We obtain the optimal dropout hyperparameter by estimating the ground truth ${\beta}^0$ with generalized ridge typed estimator $\hat{{\beta}}=(X^TX+\alpha\cdot\mathrm{diag}(X^TX))^{-1}X^T{y}$. Moreover, we empirically show that optimal dropout can achieve a monotonic test error curve in nonlinear neural networks using Fashion-MNIST and CIFAR-10. Our results suggest considering dropout for risk curve scaling when meeting the peak phenomenon. In addition, we figure out why previous deep learning models do not encounter double-descent scenarios -- because we already apply a usual regularization approach like the dropout in our models. To our best knowledge, this paper is the first to analyze the relationship between dropout and double descent.
翻译:本文发现并分析了一种现象:仅在全连接线性层前添加一个丢弃层即可轻松消除双重下降。近年来,令人惊讶的双重下降现象引起了公众关注,表现为随着样本量或模型规模增加,预测误差先升后降。本文从理论和实证两方面证明,在线性回归模型与非线性随机特征回归中,通过使用最优丢弃机制可缓解此现象。% ${y}=X{\beta}^0+{\epsilon}$ 其中 $X\in\mathbb{R}^{n\times p}$。我们通过广义岭型估计量 $\hat{{\beta}}=(X^TX+\alpha\cdot\mathrm{diag}(X^TX))^{-1}X^T{y}$ 估计真实参数 ${\beta}^0$,进而获得最优丢弃超参数。此外,在Fashion-MNIST和CIFAR-10数据集上的非线性神经网络实验中,我们实证表明最优丢弃机制可实现单调的测试误差曲线。研究结果表明,在遭遇峰值现象时,可考虑采用丢弃机制进行风险曲线缩放。同时,我们揭示了以往深度学习模型未遭遇双重下降场景的原因——即模型已采用如丢弃等常规正则化方法。据我们所知,本文首次分析了丢弃机制与双重下降之间的关联。