Generalized cross-validation (GCV) is a widely-used method for estimating the squared out-of-sample prediction risk that employs a scalar degrees of freedom adjustment (in a multiplicative sense) to the squared training error. In this paper, we examine the consistency of GCV for estimating the prediction risk of arbitrary ensembles of penalized least-squares estimators. We show that GCV is inconsistent for any finite ensemble of size greater than one. Towards repairing this shortcoming, we identify a correction that involves an additional scalar correction (in an additive sense) based on degrees of freedom adjusted training errors from each ensemble component. The proposed estimator (termed CGCV) maintains the computational advantages of GCV and requires neither sample splitting, model refitting, or out-of-bag risk estimation. The estimator stems from a finer inspection of the ensemble risk decomposition and two intermediate risk estimators for the components in this decomposition. We provide a non-asymptotic analysis of the CGCV and the two intermediate risk estimators for ensembles of convex penalized estimators under Gaussian features and a linear response model. Furthermore, in the special case of ridge regression, we extend the analysis to general feature and response distributions using random matrix theory, which establishes model-free uniform consistency of CGCV.
翻译:广义交叉验证(GCV)是一种广泛使用的估计样本外预测风险平方的方法,它通过标量自由度调整(以乘法方式)对训练误差平方进行修正。本文研究了GCV在估计任意有限集成惩罚最小二乘估计器预测风险时的一致性。我们发现,对于规模大于一的任意有限集成,GCV是不一致的。为解决这一缺陷,我们提出了一种基于每个集成组件自由度调整后的训练误差的额外标量修正(以加法方式)。所提出的估计器(称为CGCV)保留了GCV的计算优势,且无需样本分割、模型重新拟合或袋外风险估计。该估计器源于对集成风险分解及其两个中间风险估计器的精细分析。在高斯特征和线性响应模型下,我们对凸惩罚估计器集成的CGCV及两个中间风险估计器进行了非渐近分析。此外,在岭回归特例中,我们利用随机矩阵理论将分析扩展到一般特征和响应分布,从而建立了CGCV的无模型一致收敛性。