CopulaSMOTE: A Copula-Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction

Class imbalance remains a practical obstacle in the development of clinical prediction models for conditions such as diabetes mellitus, where the number of confirmed cases is often much smaller than the number of controls. The Synthetic Minority Over-sampling Technique (SMOTE) and its variants are widely used to address this imbalance, but they generate synthetic observations through local interpolation in feature space and do not explicitly model the joint dependence structure of the minority class. To address this challenge, our study introduces a copula-based data augmentation approach that estimates the minority-class dependence structure when generating synthetic samples and integrates with standard machine learning techniques. Specifically, we employ truncated vine copulas to represent multivariate dependence through a sequence of bivariate building blocks. We evaluate the proposed approach on three public diabetes datasets, namely the Pima Indians Diabetes dataset, the Iraqi Diabetes dataset, and the CDC BRFSS 2015 Diabetes Health Indicators dataset, which together cover a range of sample sizes, dimensionalities, and imbalance regimes. For each dataset, five resampling strategies are compared across five classifiers using a 5 by 2 cross validation protocol with Dietterich's paired t test. Our findings suggest that CopulaSMOTE can improve minority-class recovery in larger tabular diabetes datasets, particularly the CDC BRFSS dataset, but its advantages depend on the classifier and evaluation metric.

翻译：类别不平衡仍是糖尿病等疾病临床预测模型开发中的实际障碍，其中确诊病例数量通常远少于对照组数量。合成少数类过采样技术（SMOTE）及其变体被广泛用于解决这种不平衡问题，但它们通过特征空间中的局部插值生成合成观测值，并未显式建模少数类别的联合依赖结构。为解决这一挑战，本研究提出一种基于Copula的数据增强方法，该方法在生成合成样本时估计少数类别的依赖结构，并与标准机器学习技术集成。具体而言，我们采用截断藤Copula通过一系列双变量构建块来表示多维依赖关系。我们在三个公开糖尿病数据集（即皮马印第安人糖尿病数据集、伊拉克糖尿病数据集和CDC BRFSS 2015糖尿病健康指标数据集）上评估了所提方法，这些数据集涵盖了不同的样本规模、维度和不平衡程度。对于每个数据集，我们采用5×2交叉验证协议结合Dietterich配对t检验，比较了五种分类器下的五种重采样策略。我们的研究表明，SMOTE-Copula能够改善较大表格型糖尿病数据集（尤其是CDC BRFSS数据集）中少数类别的恢复能力，但其优势取决于分类器和评估指标。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

Nat. Biotechnol. | 利用生成式深度学习模型发现Ⅱ型糖尿病药物-组学相关性

专知会员服务

14+阅读 · 2023年1月9日

【TPAMI2022】关联关系驱动的多模态分类，AF: An Association-based Fusion Method for Multi-Modal Classification

专知会员服务

27+阅读 · 2022年3月22日

【ICLR 2022 paper解读】将公平性注入机器学习模型，降低模型偏差，即使用于训练模型的数据集是不平衡的

专知会员服务

33+阅读 · 2022年3月10日

【NeurIPS2021】半监督节点分类的拓扑不平衡学习

专知会员服务

16+阅读 · 2021年10月18日