We introduce a new cross-validation method based on an equicorrelated Gaussian randomization scheme. Our method is well-suited for problems where sample splitting is infeasible, either because the data violate the assumption of independent and identically distributed samples, or because there are insufficient samples to form representative train-test data pairs. In such problems, our method provides a simple, principled, and computationally efficient approach to estimating prediction error, often outperforming standard cross-validation while requiring only a small number of repetitions. Drawing inspiration from recent splitting techniques like data fission and data thinning, our method constructs train-test data pairs using Gaussian randomization. Our main contribution is the introduction of an antithetic Gaussian randomization scheme, involving a carefully designed correlation structure among the randomization variables. We show theoretically that this antithetic construction can eliminate the bias of cross-validation for a broad class of smooth prediction functions, without inflating variance. Through simulations across a range of data types and loss functions, we demonstrate that our estimator outperforms existing methods for prediction error estimation.
翻译:本文提出一种基于等相关高斯随机化方案的新型交叉验证方法。该方法特别适用于样本划分不可行的问题场景,包括数据违反独立同分布假设的情况,以及样本数量不足以形成具有代表性的训练-测试数据对的情形。在此类问题中,本方法提供了一种简单、有理论依据且计算高效的预测误差估计方法,通常能在仅需少量重复实验的情况下超越标准交叉验证的性能。受数据裂变与数据稀疏化等最新划分技术的启发,本方法通过高斯随机化构建训练-测试数据对。我们的核心贡献在于引入了反相关高斯随机化方案,该方案在随机变量间建立了精心设计的相关结构。我们从理论上证明,对于一大类光滑预测函数,这种反相关构造能够消除交叉验证的偏差,同时不会增加方差。通过在不同数据类型和损失函数下的模拟实验,我们证明本估计器在预测误差估计方面优于现有方法。