The Area Under the ROC Curve (AUC) is an important model metric for evaluating binary classifiers, and many algorithms have been proposed to optimize AUC approximately. It raises the question of whether the generally insignificant gains observed by previous studies are due to inherent limitations of the metric or the inadequate quality of optimization. To better understand the value of optimizing for AUC, we present an efficient algorithm, namely AUC-opt, to find the provably optimal AUC linear classifier in $\mathbb{R}^2$, which runs in $\mathcal{O}(n_+ n_- \log (n_+ n_-))$ where $n_+$ and $n_-$ are the number of positive and negative samples respectively. Furthermore, it can be naturally extended to $\mathbb{R}^d$ in $\mathcal{O}((n_+n_-)^{d-1}\log (n_+n_-))$ by calling AUC-opt in lower-dimensional spaces recursively. We prove the problem is NP-complete when $d$ is not fixed, reducing from the \textit{open hemisphere problem}. Experiments show that compared with other methods, AUC-opt achieves statistically significant improvements on between 17 to 40 in $\mathbb{R}^2$ and between 4 to 42 in $\mathbb{R}^3$ of 50 t-SNE training datasets. However, generally the gain proves insignificant on most testing datasets compared to the best standard classifiers. Similar observations are found for nonlinear AUC methods under real-world datasets.
翻译:ROC曲线下面积(AUC)是评估二分类器的重要模型指标,已有众多算法被提出用于近似优化AUC。这引发了一个问题:先前研究观察到的普遍不显著增益,究竟是源于指标本身的固有局限性,还是优化方法的质量不足?为更深入理解AUC优化的价值,我们提出一种高效算法AUC-opt,可在$\mathbb{R}^2$中找到可证明最优的AUC线性分类器,其运行时间为$\mathcal{O}(n_+ n_- \log (n_+ n_-))$,其中$n_+$和$n_-$分别代表正负样本数量。此外,通过递归调用低维空间中的AUC-opt,该算法可自然扩展至$\mathbb{R}^d$,时间复杂度为$\mathcal{O}((n_+n_-)^{d-1}\log (n_+n_-))$。我们证明当$d$不固定时,该问题属于NP完全问题(归约自\textit{开半球问题})。实验表明,在50个t-SNE训练数据集上,相较于其他方法,AUC-opt在$\mathbb{R}^2$和$\mathbb{R}^3$中分别于17至40个和4至42个数据集上取得统计显著的性能提升。然而,在大多数测试数据集上,相较于最佳标准分类器,这种增益通常并不显著。在真实世界数据集上,非线性AUC方法也呈现出类似现象。