In this paper, we find and analyze that we can easily drop the double descent by only adding one dropout layer before the fully-connected linear layer. The surprising double-descent phenomenon has drawn public attention in recent years, making the prediction error rise and drop as we increase either sample or model size. The current paper shows that it is possible to alleviate these phenomena by using optimal dropout in the linear regression model and the nonlinear random feature regression, both theoretically and empirically. % ${y}=X{\beta}^0+{\epsilon}$ with $X\in\mathbb{R}^{n\times p}$. We obtain the optimal dropout hyperparameter by estimating the ground truth ${\beta}^0$ with generalized ridge typed estimator $\hat{{\beta}}=(X^TX+\alpha\cdot\mathrm{diag}(X^TX))^{-1}X^T{y}$. Moreover, we empirically show that optimal dropout can achieve a monotonic test error curve in nonlinear neural networks using Fashion-MNIST and CIFAR-10. Our results suggest considering dropout for risk curve scaling when meeting the peak phenomenon. In addition, we figure out why previous deep learning models do not encounter double-descent scenarios -- because we already apply a usual regularization approach like the dropout in our models. To our best knowledge, this paper is the first to analyze the relationship between dropout and double descent.
翻译:本文发现并分析了一种简单现象:仅在全连接线性层前添加一个dropout层,即可轻松消除双重下降。近年来,令人惊讶的双重下降现象引起了公众关注,即随着样本量或模型规模的增加,预测误差先升后降。本文从理论和实证角度证明,在线性回归模型和非线性随机特征回归中,通过使用最优dropout可以缓解这一现象。% ${y}=X{\beta}^0+{\epsilon}$,其中 $X\in\mathbb{R}^{n\times p}$。我们通过广义岭型估计量 $\hat{{\beta}}=(X^TX+\alpha\cdot\mathrm{diag}(X^TX))^{-1}X^T{y}$ 估计真实参数 ${\beta}^0$,从而获得最优dropout超参数。此外,我们基于Fashion-MNIST和CIFAR-10数据集实证表明,在非线性神经网络中,最优dropout可实现单调的测试误差曲线。研究结果表明,当遇到峰值现象时,可考虑使用dropout进行风险曲线缩放。同时,我们揭示了以往深度学习模型未遭遇双重下降场景的原因——因为模型已采用了如dropout等常规正则化方法。据我们所知,本文是首个分析dropout与双重下降关系的研究。