Class imbalance remains a practical obstacle in the development of clinical prediction models for conditions such as diabetes mellitus, where the number of confirmed cases is often much smaller than the number of controls. The Synthetic Minority Over-sampling Technique (SMOTE) and its variants are widely used to address this imbalance, but they generate synthetic observations through local interpolation in feature space and do not explicitly model the joint dependence structure of the minority class. To address this challenge, our study introduces a copula-based data augmentation approach that estimates the minority-class dependence structure when generating synthetic samples and integrates with standard machine learning techniques. Specifically, we employ truncated vine copulas to represent multivariate dependence through a sequence of bivariate building blocks. We evaluate the proposed approach on three public diabetes datasets, namely the Pima Indians Diabetes dataset, the Iraqi Diabetes dataset, and the CDC BRFSS 2015 Diabetes Health Indicators dataset, which together cover a range of sample sizes, dimensionalities, and imbalance regimes. For each dataset, five resampling strategies are compared across five classifiers using a 5 by 2 cross validation protocol with Dietterich's paired t test. Our findings suggest that CopulaSMOTE can improve minority-class recovery in larger tabular diabetes datasets, particularly the CDC BRFSS dataset, but its advantages depend on the classifier and evaluation metric.
翻译:类别不平衡仍是糖尿病等疾病临床预测模型开发中的实际障碍,其中确诊病例数量通常远少于对照组数量。合成少数类过采样技术(SMOTE)及其变体被广泛用于解决这种不平衡问题,但它们通过特征空间中的局部插值生成合成观测值,并未显式建模少数类别的联合依赖结构。为解决这一挑战,本研究提出一种基于Copula的数据增强方法,该方法在生成合成样本时估计少数类别的依赖结构,并与标准机器学习技术集成。具体而言,我们采用截断藤Copula通过一系列双变量构建块来表示多维依赖关系。我们在三个公开糖尿病数据集(即皮马印第安人糖尿病数据集、伊拉克糖尿病数据集和CDC BRFSS 2015糖尿病健康指标数据集)上评估了所提方法,这些数据集涵盖了不同的样本规模、维度和不平衡程度。对于每个数据集,我们采用5×2交叉验证协议结合Dietterich配对t检验,比较了五种分类器下的五种重采样策略。我们的研究表明,SMOTE-Copula能够改善较大表格型糖尿病数据集(尤其是CDC BRFSS数据集)中少数类别的恢复能力,但其优势取决于分类器和评估指标。