Data rebalancing techniques, including oversampling and undersampling, are a common approach to addressing the challenges of imbalanced data. To tackle unresolved problems related to both oversampling and undersampling, we propose a new undersampling approach that: (i) avoids the pitfalls of noise and overlap caused by synthetic data and (ii) avoids the pitfall of under-fitting caused by random undersampling. Instead of undersampling majority data randomly, our method undersamples datapoints based on their ability to improve model loss. Using improved model loss as a proxy measurement for classification performance, our technique assesses a datapoint's impact on loss and rejects those unable to improve it. In so doing, our approach rejects majority datapoints redundant to datapoints already accepted and, thereby, finds an optimal subset of majority training data for classification. The accept/reject component of our algorithm is motivated by a bilevel optimization problem uniquely formulated to identify the optimal training set we seek. Experimental results show our proposed technique with F1 scores up to 10% higher than state-of-the-art methods.
翻译:数据重平衡技术(包括过采样与欠采样)是解决不平衡数据挑战的常用方法。为应对过采样与欠采样中尚未解决的问题,我们提出一种新的欠采样方法,该方法:(i)避免了合成数据导致的噪声与重叠问题,(ii)避免了随机欠采样导致的欠拟合问题。我们的方法并非随机对多数类数据进行欠采样,而是根据数据点改善模型损失的能力进行采样。通过将改进的模型损失作为分类性能的代理度量,我们的技术评估每个数据点对损失的影响,并剔除无法改善损失的数据点。在此过程中,我们的方法剔除了与已接受数据点冗余的多数类数据点,从而为分类任务找到最优的多数类训练数据子集。算法中接受/剔除组件的设计灵感来源于一个双层优化问题,该问题被独特地构建以识别我们寻求的最优训练集。实验结果表明,我们提出的技术相比现有最优方法,F1分数最高可提升10%。