Cross-validation is a common method for estimating the predictive performance of machine learning models. In a data-scarce regime, where one typically wishes to maximize the number of instances used for training the model, an approach called "leave-one-out cross-validation" is often used. In this design, a separate model is built for predicting each data instance after training on all other instances. Since this results in a single test data point available per model trained, predictions are aggregated across the entire dataset to calculate common rank-based performance metrics such as the area under the receiver operating characteristic or precision-recall curves. In this work, we demonstrate that this approach creates a negative correlation between the average label of each training fold and the label of its corresponding test instance, a phenomenon that we term distributional bias. As machine learning models tend to regress to the mean of their training data, this distributional bias tends to negatively impact performance evaluation and hyperparameter optimization. We show that this effect generalizes to leave-P-out cross-validation and persists across a wide range of modeling and evaluation approaches, and that it can lead to a bias against stronger regularization. To address this, we propose a generalizable rebalanced cross-validation approach that corrects for distributional bias. We demonstrate that our approach improves cross-validation performance evaluation in synthetic simulations and in several published leave-one-out analyses.
翻译:交叉验证是评估机器学习模型预测性能的常用方法。在数据稀缺的情况下,通常希望最大化用于训练模型的样本量,此时常采用称为"留一交叉验证"的方法。在该设计中,每次使用除一个样本外的所有其他样本训练模型,并用该模型预测被留出的样本。由于每个训练模型仅对应单个测试数据点,因此需要在整个数据集上聚合预测结果以计算基于排序的常用性能指标,如受试者工作特征曲线下面积或精确率-召回率曲线。本研究发现,该方法会在每个训练折叠的平均标签与其对应测试样本的标签之间产生负相关,我们将此现象称为分布偏差。由于机器学习模型倾向于向其训练数据的均值回归,这种分布偏差往往会对性能评估和超参数优化产生负面影响。我们证明该效应可推广至留P交叉验证,并在多种建模与评估方法中持续存在,且可能导致对较强正则化的偏见。为此,我们提出一种可推广的再平衡交叉验证方法以校正分布偏差。通过合成模拟和多个已发表的留一分析案例,我们验证了该方法能有效提升交叉验证的性能评估质量。