Cross-validation (CV) is one of the most popular tools for assessing and selecting predictive models. However, standard CV suffers from high computational cost when the number of folds is large. Recently, under the empirical risk minimization (ERM) framework, a line of works proposed efficient methods to approximate CV based on the solution of the ERM problem trained on the full dataset. However, in large-scale problems, it can be hard to obtain the exact solution of the ERM problem, either due to limited computational resources or due to early stopping as a way of preventing overfitting. In this paper, we propose a new paradigm to efficiently approximate CV when the ERM problem is solved via an iterative first-order algorithm, without running until convergence. Our new method extends existing guarantees for CV approximation to hold along the whole trajectory of the algorithm, including at convergence, thus generalizing existing CV approximation methods. Finally, we illustrate the accuracy and computational efficiency of our method through a range of empirical studies.
翻译:交叉验证(CV)是评估和选择预测模型最常用的工具之一。然而,当折叠数量较大时,标准交叉验证存在计算成本高的问题。近期,在经验风险最小化(ERM)框架下,一系列工作提出了基于全数据集训练所得ERM问题解的高效近似交叉验证方法。但在大规模问题中,由于计算资源有限或为避免过拟合而提前终止训练,往往难以获得ERM问题的精确解。本文提出一种新范式:当采用迭代一阶算法求解ERM问题且无需运行至收敛时,仍能高效近似交叉验证。我们的新方法将现有交叉验证近似保证扩展至算法整个迭代轨迹(包括收敛点),从而推广了现有交叉验证近似方法。最后通过一系列实证研究验证了本方法的准确性和计算效率。