Imbalanced learning occurs in classification settings where the distribution of class-labels is highly skewed in the training data, such as when predicting rare diseases or in fraud detection. This class imbalance presents a significant algorithmic challenge, which can be further exacerbated when privacy-preserving techniques such as differential privacy are applied to protect sensitive training data. Our work formalizes these challenges and provides a number of algorithmic solutions. We consider DP variants of pre-processing methods that privately augment the original dataset to reduce the class imbalance; these include oversampling, SMOTE, and private synthetic data generation. We also consider DP variants of in-processing techniques, which adjust the learning algorithm to account for the imbalance; these include model bagging, class-weighted empirical risk minimization and class-weighted deep learning. For each method, we either adapt an existing imbalanced learning technique to the private setting or demonstrate its incompatibility with differential privacy. Finally, we empirically evaluate these privacy-preserving imbalanced learning methods under various data and distributional settings. We find that private synthetic data methods perform well as a data pre-processing step, while class-weighted ERMs are an alternative in higher-dimensional settings where private synthetic data suffers from the curse of dimensionality.
翻译:类别不平衡学习发生在分类场景中,其中训练数据中类别标签的分布高度偏斜,例如在预测罕见疾病或欺诈检测中。这种类别不平衡带来了显著的算法挑战,当应用差分隐私等隐私保护技术来保护敏感训练数据时,这一挑战可能进一步加剧。我们的工作形式化了这些挑战,并提供了一系列算法解决方案。我们考虑了预处理方法的差分隐私变体,这些方法通过隐私增强方式扩充原始数据集以减少类别不平衡;这些方法包括过采样、SMOTE 以及隐私保护的合成数据生成。我们还考虑了处理中技术的差分隐私变体,这些技术通过调整学习算法来应对不平衡问题;包括模型装袋、类别加权经验风险最小化以及类别加权深度学习。对于每种方法,我们或者将现有的不平衡学习技术适配到隐私保护场景,或者证明其与差分隐私的不兼容性。最后,我们在多种数据和分布设置下对这些隐私保护的不平衡学习方法进行了实证评估。我们发现,隐私保护的合成数据方法作为数据预处理步骤表现良好,而在高维设置中,当隐私保护的合成数据方法受维度灾难影响时,类别加权的经验风险最小化方法是一个可行的替代方案。