Class imbalance is common when developing clinical prediction models (CPMs) and is often assumed to lead to poor predictive performance. Several methods have been proposed to correct data imbalance during CPM development. However, it remains unclear whether correcting class imbalance improves or harms CPM performance. This study investigated how imbalance correction affects classification performance and prediction stability. We simulated the development and internal validation of CPMs using penalised logistic regression under different imbalance-correction strategies, including algorithm-level rebalancing, data-level rebalancing by oversampling, and combined over- and under-sampling. The simulation dataset was derived from the GUSTO-I trial, which included 40,830 patients and 2,851 events. All imbalance-correction strategies were evaluated across sample-size scenarios ranging from 500 to 40,830. Model performance and prediction stability were assessed using 200 bootstrap resamples, including discrimination, calibration, calibration stability, mean absolute prediction error (MAPE), and classification instability index (CII). Class imbalance correction did not meaningfully improve model discrimination. Both data-level and algorithm-level correction led to miscalibration, risk overestimation, and increased prediction instability, as shown by prediction stability, MAPE, and CII plots, compared with models developed without correction. These findings suggest that class imbalance correction does not necessarily improve CPM performance and may compromise calibration and prediction stability. Class imbalance should not be treated as a pathology that automatically requires correction. In clinical prediction modelling, routine imbalance correction by default is generally not advisable.
翻译:类别不平衡在临床预测模型开发中普遍存在,通常被认为会导致预测性能不佳。已有多种方法被提出用于在临床预测模型开发过程中校正数据不平衡。然而,类别不平衡校正是否能改善或损害临床预测模型性能仍不明确。本研究探讨了不平衡校正对分类性能及预测稳定性的影响。我们模拟了在不同不平衡校正策略下(包括算法级再平衡、通过过采样的数据级再平衡以及过采样与欠采样联合方法),使用惩罚逻辑回归开发临床预测模型并进行内部验证。模拟数据集源于GUSTO-I试验,包含40,830例患者和2,851个事件。所有不平衡校正策略均在样本量从500至40,830的多种场景下进行评估。模型性能与预测稳定性采用200次Bootstrap重抽样评估,包括鉴别力、校准度、校准稳定性、平均绝对预测误差及分类不稳定性指数。类别不平衡校正未显著提升模型鉴别力。与未经校正的模型相比,数据级和算法级校正均导致校准偏差、风险高估及预测不稳定性增加,具体体现在预测稳定性、平均绝对预测误差及分类不稳定性指数图表中。这些发现表明,类别不平衡校正未必能改善临床预测模型性能,反而可能损害校准度与预测稳定性。类别不平衡不应被视为需要自动校正的病理状态。在临床预测建模中,默认进行常规不平衡校正通常不可取。