Imbalanced data, where the positive samples represent only a small proportion compared to the negative samples, makes it challenging for classification problems to balance the false positive and false negative rates. A common approach to addressing the challenge involves generating synthetic data for the minority group and then training classification models with both observed and synthetic data. However, since the synthetic data depends on the observed data and fails to replicate the original data distribution accurately, prediction accuracy is reduced when the synthetic data is naïvely treated as the true data. In this paper, we address the bias introduced by synthetic data and provide consistent estimators for this bias by borrowing information from the majority group. We propose a bias correction procedure to mitigate the adverse effects of synthetic data, enhancing prediction accuracy while avoiding overfitting. This procedure is extended to broader scenarios with imbalanced data, such as imbalanced multi-task learning and causal inference. Theoretical properties, including bounds on bias estimation errors and improvements in prediction accuracy, are provided. Simulation results and data analysis on handwritten digit datasets demonstrate the effectiveness of our method.
翻译:不平衡数据中,正样本仅占负样本的一小部分,这使得分类问题在平衡假阳性率与假阴性率方面面临挑战。应对这一挑战的常见方法是为少数类生成合成数据,然后使用观测数据与合成数据共同训练分类模型。然而,由于合成数据依赖于观测数据且无法精确复现原始数据分布,若将合成数据简单视为真实数据,则会降低预测准确性。本文针对合成数据引入的偏差问题,通过借用多数类的信息,提出了该偏差的一致性估计量。我们设计了一种偏差校正流程,以减轻合成数据的不利影响,在提升预测准确性的同时避免过拟合。该流程可进一步推广至更广泛的不平衡数据场景,如不平衡多任务学习与因果推断。本文给出了包括偏差估计误差界及预测准确性提升在内的理论性质。手写数字数据集上的仿真结果与数据分析验证了本方法的有效性。