Cross-validation (CV) is one of the most widely used techniques in statistical learning for estimating the test error of a model, but its behavior is not yet fully understood. It has been shown that standard confidence intervals for test error using estimates from CV may have coverage below nominal levels. This phenomenon occurs because each sample is used in both the training and testing procedures during CV and as a result, the CV estimates of the errors become correlated. Without accounting for this correlation, the estimate of the variance is smaller than it should be. One way to mitigate this issue is by estimating the mean squared error of the prediction error instead using nested CV. This approach has been shown to achieve superior coverage compared to intervals derived from standard CV. In this work, we generalize the nested CV idea to the Cox proportional hazards model and explore various choices of test error for this setting.
翻译:交叉验证(CV)是统计学习中用于估计模型测试误差最广泛使用的技术之一,但其行为尚未被完全理解。研究表明,使用CV估计得到的测试误差标准置信区间的覆盖率可能低于名义水平。这一现象的产生是由于每个样本在CV过程中既参与训练又参与测试,导致CV误差估计值之间存在相关性。若未考虑此相关性,方差估计值将小于实际值。缓解该问题的一种方法是改用嵌套交叉验证来估计预测误差的均方误差。已有研究表明,与标准CV推导的区间相比,该方法能获得更优的覆盖率。本文将此嵌套CV思想推广至Cox比例风险模型,并探讨该情境下测试误差的多种选择方案。