Cross-Validation (CV) is the default choice for evaluating the performance of machine learning models. Despite its wide usage, their statistical benefits have remained half-understood, especially in challenging nonparametric regimes. In this paper we fill in this gap and show that in fact, for a wide spectrum of models, CV does not statistically outperform the simple "plug-in" approach where one reuses training data for testing evaluation. Specifically, in terms of both the asymptotic bias and coverage accuracy of the associated interval for out-of-sample evaluation, $K$-fold CV provably cannot outperform plug-in regardless of the rate at which the parametric or nonparametric models converge. Leave-one-out CV can have a smaller bias as compared to plug-in; however, this bias improvement is negligible compared to the variability of the evaluation, and in some important cases leave-one-out again does not outperform plug-in once this variability is taken into account. We obtain our theoretical comparisons via a novel higher-order Taylor analysis that allows us to derive necessary conditions for limit theorems of testing evaluations, which applies to model classes that are not amenable to previously known sufficient conditions. Our numerical results demonstrate that plug-in performs indeed no worse than CV across a wide range of examples.
翻译:交叉验证(CV)是评估机器学习模型性能的默认选择。尽管其应用广泛,但其统计优势仍未得到充分理解,尤其是在具有挑战性的非参数体系中。本文填补了这一空白,并证明事实上,对于广泛的模型谱系,CV在统计上并未优于简单的"插件"方法——即在测试评估中重复使用训练数据。具体而言,在样本外评估的渐近偏差和相关区间的覆盖精度方面,无论参数或非参数模型的收敛速率如何,$K$折CV在理论上都无法超越插件法。留一法CV相较于插件法可能具有更小的偏差;然而,这种偏差改进与评估的变异性相比微不足道,并且在某些重要情况下,一旦考虑这种变异性,留一法同样无法超越插件法。我们通过一种新颖的高阶泰勒分析获得理论比较结果,该分析使我们能够推导测试评估极限定理的必要条件,适用于先前已知充分条件无法处理的模型类别。我们的数值结果表明,在广泛的示例中,插件法的表现确实不逊于CV。