A central problem in data science is to use potentially noisy samples of an unknown function to predict values for unseen inputs. In classical statistics, predictive error is understood as a trade-off between the bias and the variance that balances model simplicity with its ability to fit complex functions. However, over-parameterized models exhibit counterintuitive behaviors, such as "double descent" in which models of increasing complexity exhibit decreasing generalization error. Others may exhibit more complicated patterns of predictive error with multiple peaks and valleys. Neither double descent nor multiple descent phenomena are well explained by the bias-variance decomposition. We introduce a novel decomposition that we call the generalized aliasing decomposition (GAD) to explain the relationship between predictive performance and model complexity. The GAD decomposes the predictive error into three parts: 1) model insufficiency, which dominates when the number of parameters is much smaller than the number of data points, 2) data insufficiency, which dominates when the number of parameters is much greater than the number of data points, and 3) generalized aliasing, which dominates between these two extremes. We demonstrate the applicability of the GAD to diverse applications, including random feature models from machine learning, Fourier transforms from signal processing, solution methods for differential equations, and predictive formation enthalpy in materials discovery. Because key components of the GAD can be explicitly calculated from the relationship between model class and samples without seeing any data labels, it can answer questions related to experimental design and model selection before collecting data or performing experiments. We further demonstrate this approach on several examples and discuss implications for predictive modeling and data science.
翻译:数据科学中的一个核心问题是利用未知函数的潜在含噪样本来预测未见输入的值。在经典统计学中,预测误差被理解为偏差与方差之间的权衡,这种权衡平衡了模型的简洁性与拟合复杂函数的能力。然而,过参数化模型表现出反直觉的行为,例如"双重下降"现象,即随着模型复杂度增加,其泛化误差反而降低。其他模型可能表现出更复杂的预测误差模式,具有多个峰值和谷值。无论是双重下降还是多重下降现象,都无法通过偏差-方差分解得到很好的解释。我们提出了一种新颖的分解方法,称为广义混叠分解(GAD),用以解释预测性能与模型复杂度之间的关系。GAD将预测误差分解为三个部分:1)模型不足性,当参数数量远小于数据点数量时占主导地位;2)数据不足性,当参数数量远大于数据点数量时占主导地位;3)广义混叠,在这两个极端之间占主导地位。我们证明了GAD在多种应用中的适用性,包括机器学习中的随机特征模型、信号处理中的傅里叶变换、微分方程的求解方法以及材料发现中形成焓的预测。由于GAD的关键组成部分可以直接从模型类别与样本之间的关系中显式计算,而无需查看任何数据标签,因此它可以在收集数据或进行实验之前回答与实验设计和模型选择相关的问题。我们进一步通过几个示例展示了这种方法,并讨论了其对预测建模和数据科学的意义。