Cross-validation (CV) is one of the most popular tools for assessing and selecting predictive models. However, standard CV suffers from high computational cost when the number of folds is large. Recently, under the empirical risk minimization (ERM) framework, a line of works proposed efficient methods to approximate CV based on the solution of the ERM problem trained on the full dataset. However, in large-scale problems, it can be hard to obtain the exact solution of the ERM problem, either due to limited computational resources or due to early stopping as a way of preventing overfitting. In this paper, we propose a new paradigm to efficiently approximate CV when the ERM problem is solved via an iterative first-order algorithm, without running until convergence. Our new method extends existing guarantees for CV approximation to hold along the whole trajectory of the algorithm, including at convergence, thus generalizing existing CV approximation methods. Finally, we illustrate the accuracy and computational efficiency of our method through a range of empirical studies.
翻译:交叉验证(CV)是评估和选择预测模型最常用的工具之一。然而,当折数较大时,标准交叉验证面临计算成本高的问题。近年来,在经验风险最小化(ERM)框架下,一系列工作提出基于完整数据集训练的ERM问题解来高效近似交叉验证的方法。但在大规模问题中,由于计算资源限制或因防止过拟合而提前停止训练,很难获得ERM问题的精确解。本文提出一种新的范式,当通过迭代一阶算法求解ERM问题时,可在不运行至收敛的情况下高效近似交叉验证。我们的新方法将现有的交叉验证近似保证扩展到算法整个迭代轨迹(包括收敛点),从而推广了现有的交叉验证近似方法。最后,通过一系列实证研究,我们展示了该方法的准确性和计算效率。