Unbiased Estimation of Structured Prediction Error

Many modern datasets, such as those in ecology and geology, are composed of samples with spatial structure and dependence. With such data violating the usual independent and identically distributed (IID) assumption in machine learning and classical statistics, it is unclear a priori how one should measure the performance and generalization of models. Several authors have empirically investigated cross-validation (CV) methods in this setting, reaching mixed conclusions. We provide a class of unbiased estimation methods for general quadratic errors, correlated Gaussian response, and arbitrary prediction function $g$, for a noise-elevated version of the error. Our approach generalizes the coupled bootstrap (CB) from the normal means problem to general normal data, allowing correlation both within and between the training and test sets. CB relies on creating bootstrap samples that are intelligently decoupled, in the sense of being statistically independent. Specifically, the key to CB lies in generating two independent "views" of our data and using them as stand-ins for the usual independent training and test samples. Beginning with Mallows' $C_p$, we generalize the estimator to develop our generalized $C_p$ estimators (GC). We show at under only a moment condition on $g$, this noise-elevated error estimate converges smoothly to the noiseless error estimate. We show that when Stein's unbiased risk estimator (SURE) applies, GC converges to SURE as in the normal means problem. Further, we use these same tools to analyze CV and provide some theoretical analysis to help understand when CV will provide good estimates of error. Simulations align with our theoretical results, demonstrating the effectiveness of GC and illustrating the behavior of CV methods. Lastly, we apply our estimator to a model selection task on geothermal data in Nevada.

翻译：许多现代数据集（例如生态学和地质学中的数据集）由具有空间结构和依赖性的样本组成。由于此类数据违反了机器学习和经典统计学中常见的独立同分布（IID）假设，因此如何衡量模型的性能与泛化能力在理论上是未明确的。一些学者在此背景下对交叉验证（CV）方法进行了实证研究，得出了不一致的结论。我们针对一般二次误差、相关高斯响应以及任意预测函数$g$，提出了一类针对噪声增强版误差的无偏估计方法。我们的方法将耦合自助法（CB）从正态均值问题推广到一般正态数据，允许训练集和测试集内部及之间存在相关性。CB的核心在于生成智能解耦的（即统计独立的）自助样本：具体而言，CB的关键是生成数据的两个独立"视图"，并将其作为通常独立的训练样本和测试样本的替代。从Mallows' $C_p$出发，我们将该估计量推广为广义$C_p$估计量（GC）。我们证明，在仅需对$g$施加矩条件的情况下，该噪声增强误差估计平滑地收敛到无噪声误差估计。同时，当Stein无偏风险估计（SURE）适用时，GC在正态均值问题意义下收敛于SURE。此外，我们利用相同工具分析CV，并提供理论分析以帮助理解CV何时能给出良好的误差估计。模拟结果与理论结论一致，证明了GC的有效性，并揭示了CV方法的行为特征。最后，我们将所提估计量应用于内华达州地热数据的模型选择任务。