Class imbalance is a prevalent issue in real world machine learning applications, often leading to poor performance in rare and minority classes. With an abundance of wild unlabeled data, active learning is perhaps the most effective technique in solving the problem at its root -- collecting a more balanced and informative set of labeled examples during annotation. In this work, we propose a novel algorithm that first identifies the class separation threshold and then annotate the most uncertain examples from the minority classes, close to the separation threshold. Through a novel reduction to one-dimensional active learning, our algorithm DIRECT is able to leverage the classic active learning literature to address issues such as batch labeling and tolerance towards label noise. Compared to existing algorithms, our algorithm saves more than 15\% of the annotation budget compared to state-of-art active learning algorithm and more than 90\% of annotation budget compared to random sampling.
翻译:类别不平衡是现实世界机器学习应用中的普遍问题,常导致稀有和少数类别的性能低下。面对大量无标签的野生数据,主动学习或许是解决问题的根本方法——在标注过程中收集更具平衡性和信息量的标记样本。本文提出一种新算法,首先识别类别分离阈值,然后从靠近该阈值的少数类中标注最不确定的样本。通过新颖的降维至一维主动学习方法,我们的算法DIRECT能够利用经典主动学习文献解决批量标注和对标签噪声的容忍等问题。与现有算法相比,我们的算法在标注预算上比最先进的主动学习算法节省超过15%,比随机采样节省超过90%。