Traditionally, machine learning-based clinical prediction models have been trained and evaluated on patient data from a single source, such as a hospital. Cross-validation methods can be used to estimate the accuracy of such models on new patients originating from the same source, by repeated random splitting of the data. However, such estimates tend to be highly overoptimistic when compared to accuracy obtained from deploying models to sources not represented in the dataset, such as a new hospital. The increasing availability of multi-source medical datasets provides new opportunities for obtaining more comprehensive and realistic evaluations of expected accuracy through source-level cross-validation designs. In this study, we present a systematic empirical evaluation of standard K-fold cross-validation and leave-source-out cross-validation methods in a multi-source setting. We consider the task of electrocardiogram based cardiovascular disease classification, combining and harmonizing the openly available PhysioNet CinC Challenge 2021 and the Shandong Provincial Hospital datasets for our study. Our results show that K-fold cross-validation, both on single-source and multi-source data, systemically overestimates prediction performance when the end goal is to generalize to new sources. Leave-source-out cross-validation provides more reliable performance estimates, having close to zero bias though larger variability. The evaluation highlights the dangers of obtaining misleading cross-validation results on medical data and demonstrates how these issues can be mitigated when having access to multi-source data.
翻译:传统上,基于机器学习的临床预测模型均使用单一来源(例如某家医院)的患者数据进行训练和评估。通过重复随机划分数据,可采用交叉验证方法估计模型对来自同一来源新患者的准确率。然而,当模型部署至数据集中未包含的来源(如新医院)时,此类估计值往往过于乐观。多源医疗数据集的日益普及,为通过源级交叉验证设计获取更全面且符合实际的预期准确性评估提供了新机遇。本研究系统性地实证评估了标准K折交叉验证与留源交叉验证方法在多源场景下的表现。我们以心电图为基础的心血管疾病分类任务为研究对象,整合并标准化了公开可用的PhysioNet CinC Challenge 2021数据集与山东省立医院数据集。结果表明,当最终目标是对新来源进行泛化时,无论是单源数据还是多源数据上的K折交叉验证,均会系统性高估预测性能。留源交叉验证能提供更可靠的性能估计,其偏差近乎为零,但方差较大。本评估凸显了在医疗数据上获得误导性交叉验证结果的潜在风险,并展示了在获取多源数据时如何规避这些问题。