Double descent is a surprising phenomenon in machine learning, in which as the number of model parameters grows relative to the number of data, test error drops as models grow ever larger into the highly overparameterized (data undersampled) regime. This drop in test error flies against classical learning theory on overfitting and has arguably underpinned the success of large models in machine learning. This non-monotonic behavior of test loss depends on the number of data, the dimensionality of the data and the number of model parameters. Here, we briefly describe double descent, then provide an explanation of why double descent occurs in an informal and approachable manner, requiring only familiarity with linear algebra and introductory probability. We provide visual intuition using polynomial regression, then mathematically analyze double descent with ordinary linear regression and identify three interpretable factors that, when simultaneously all present, together create double descent. We demonstrate that double descent occurs on real data when using ordinary linear regression, then demonstrate that double descent does not occur when any of the three factors are ablated. We use this understanding to shed light on recent observations in nonlinear models concerning superposition and double descent. Code is publicly available.
翻译:双重下降是机器学习中一个令人惊讶的现象,即当模型参数数量相对于数据量增长时,测试误差会随着模型规模进一步扩大至高度过参数化(数据欠采样)区域而下降。这种测试误差的下降与经典学习理论中的过拟合相悖,并被认为是大型模型在机器学习中成功的基础。测试损失的非单调行为取决于数据量、数据维度以及模型参数数量。本文首先简要描述双重下降,随后以非正式且易于理解的方式解释双重下降的发生原因——仅需读者具备线性代数和初等概率论基础。我们通过多项式回归提供直观的可视化解释,接着利用普通线性回归对双重下降进行数学分析,并识别出三个可解释的因素——当这三个因素同时存在时,共同导致双重下降。我们证明,在使用普通线性回归处理真实数据时会出现双重下降,而当消融这三个因素中的任意一个时,双重下降便不再发生。基于这一理解,我们阐释了近年来在非线性模型中观察到的关于叠加与双重下降的现象。代码已公开。