We introduce a method for performing cross-validation without sample splitting. The method is well-suited for problems where traditional sample splitting is infeasible, such as when data are not assumed to be independently and identically distributed. Even in scenarios where sample splitting is possible, our method offers a computationally efficient alternative for estimating prediction error, achieving comparable or even lower error than standard cross-validation at a significantly reduced computational cost. Our approach constructs train-test data pairs using externally generated Gaussian randomization variables, drawing inspiration from recent randomization techniques such as data-fission and data-thinning. The key innovation lies in a carefully designed correlation structure among these randomization variables, referred to as antithetic Gaussian randomization. This correlation is crucial in maintaining a bounded variance while allowing the bias to vanish, offering an additional advantage over standard cross-validation, whose performance depends heavily on the bias-variance tradeoff dictated by the number of folds. We provide a theoretical analysis of the mean squared error of the proposed estimator, proving that as the level of randomization decreases to zero, the bias converges to zero, while the variance remains bounded and decays linearly with the number of repetitions. This analysis highlights the benefits of the antithetic Gaussian randomization over independent randomization. Simulation studies corroborate our theoretical findings, illustrating the robust performance of our cross-validated estimator across various data types and loss functions.
翻译:本文提出了一种无需样本分割的交叉验证方法。该方法特别适用于传统样本分割不可行的问题,例如当数据不满足独立同分布假设时。即使在样本分割可行的情况下,本方法也能为预测误差估计提供计算高效的替代方案,在显著降低计算成本的同时,其误差水平与标准交叉验证相当甚至更低。我们的方法借鉴了数据裂变和数据细化等近期随机化技术,通过外部生成的高斯随机化变量构建训练-测试数据对。核心创新在于精心设计了这些随机化变量之间的相关结构,称为反相关高斯随机化。这种相关性在保持方差有界的同时使偏差趋于零,相比标准交叉验证具有额外优势——后者的性能严重受限于折数决定的偏差-方差权衡。我们对所提估计量的均方误差进行了理论分析,证明当随机化水平趋近于零时,偏差收敛于零,而方差保持有界且随重复次数线性衰减。该分析揭示了反相关高斯随机化相对于独立随机化的优势。仿真研究验证了理论结果,展示了我们提出的交叉验证估计量在不同数据类型和损失函数下的稳健性能。