Cross-validation is a common method for estimating the predictive performance of machine learning models. In a data-scarce regime, where one typically wishes to maximize the number of instances used for training the model, an approach called "leave-one-out cross-validation" is often used. In this design, a separate model is built for predicting each data instance after training on all other instances. Since this results in a single test instance available per model trained, predictions are aggregated across the entire dataset to calculate common performance metrics such as the area under the receiver operating characteristic or R2 scores. In this work, we demonstrate that this approach creates a negative correlation between the average label of each training fold and the label of its corresponding test instance, a phenomenon that we term distributional bias. As machine learning models tend to regress to the mean of their training data, this distributional bias tends to negatively impact performance evaluation and hyperparameter optimization. We show that this effect generalizes to leave-P-out cross-validation and persists across a wide range of modeling and evaluation approaches, and that it can lead to a bias against stronger regularization. To address this, we propose a generalizable rebalanced cross-validation approach that corrects for distributional bias for both classification and regression. We demonstrate that our approach improves cross-validation performance evaluation in synthetic simulations, across machine learning benchmarks, and in several published leave-one-out analyses.
翻译:交叉验证是评估机器学习模型预测性能的常用方法。在数据稀缺的情况下,通常希望最大化用于模型训练的样本量,此时常采用名为"留一交叉验证"的方法。在该设计框架下,每个数据样本的预测模型均基于其余所有样本训练得到。由于每个训练模型仅对应单个测试样本,因此需要聚合整个数据集的预测结果来计算常见性能指标,如受试者工作特征曲线下面积或R2分数。本研究发现,该方法会在每个训练折叠的平均标签与其对应测试样本的标签之间产生负相关现象,我们将其定义为分布偏差。由于机器学习模型倾向于向其训练数据的均值回归,这种分布偏差往往会对性能评估和超参数优化产生负面影响。我们证明该效应可推广至留P交叉验证,并在多种建模与评估方法中持续存在,且可能导致对较强正则化的系统性偏差。为解决此问题,我们提出一种可泛化的再平衡交叉验证方法,能够同时校正分类与回归任务中的分布偏差。通过合成模拟实验、机器学习基准测试以及多个已发表的留一法分析案例,我们验证了该方法能有效提升交叉验证的性能评估质量。