When the competing classes in a classification problem are not of comparable size, many popular classifiers exhibit a bias towards larger classes, and the nearest neighbor classifier is no exception. To take care of this problem, we develop a statistical method for nearest neighbor classification based on such imbalanced data sets. First, we construct a classifier for the binary classification problem and then extend it for classification problems involving more than two classes. Unlike the existing oversampling or undersampling methods, our proposed classifiers do not need to generate any pseudo observations or remove any existing observations, hence the results are exactly reproducible. We establish the Bayes risk consistency of these classifiers under appropriate regularity conditions. Their superior performance over the existing methods is amply demonstrated by analyzing several benchmark data sets.
翻译:当分类问题中的竞争类别规模不相等时,许多流行分类器会表现出对较大类别的偏向,最近邻分类器也不例外。针对这一问题,我们开发了一种基于此类不平衡数据集的最近邻分类统计方法。首先,我们构建了一个二分类问题的分类器,随后将其扩展至涉及两个以上类别的分类问题。与现有的过采样或欠采样方法不同,我们提出的分类器无需生成任何伪观测值或删除任何现有观测值,因此结果具有完全可复现性。我们在适当的正则性条件下证明了这些分类器的贝叶斯风险一致性。通过对多个基准数据集的分析,充分展示了其相较于现有方法的优越性能。