The most popular classification algorithms are designed to maximize classification accuracy during training. However, this strategy may fail in the presence of class imbalance since it is possible to train models with high accuracy by overfitting to the majority class. On the other hand, the Area Under the Curve (AUC) is a widely used metric to compare classification performance of different algorithms when there is a class imbalance, and various approaches focusing on the direct optimization of this metric during training have been proposed. Among them, SVM-based formulations are especially popular as this formulation allows incorporating different regularization strategies easily. In this work, we develop a prototype learning approach that relies on cutting-plane method, similar to Ranking SVM, to maximize AUC. Our algorithm learns simpler models by iteratively introducing cutting planes, thus overfitting is prevented in an unconventional way. Furthermore, it penalizes the changes in the weights at each iteration to avoid large jumps that might be observed in the test performance, thus facilitating a smooth learning process. Based on the experiments conducted on 73 binary classification datasets, our method yields the best test AUC in 25 datasets among its relevant competitors.
翻译:最流行的分类算法旨在训练过程中最大化分类准确率。然而,当存在类别不平衡时,这种策略可能失效,因为模型可能通过过拟合多数类而获得高准确率。另一方面,曲线下面积(AUC)是衡量存在类别不平衡时不同算法分类性能的广泛使用指标,因此已有多种方法致力于在训练过程中直接优化该指标。其中,基于支持向量机(SVM)的公式尤其受欢迎,因为该公式可以轻松融入不同的正则化策略。在本工作中,我们开发了一种基于切割平面方法的原型学习方法,类似于排序支持向量机(Ranking SVM),以最大化AUC。我们的算法通过迭代引入切割平面来学习更简单的模型,从而以非常规方式防止过拟合。此外,它在每次迭代中惩罚权重的变化,以避免测试性能可能出现的大幅波动,从而促进平滑的学习过程。基于在73个二分类数据集上进行的实验,我们的方法在25个数据集中在其相关竞争对手中获得了最佳测试AUC。