In machine learning (ML), a widespread adage is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for binary classification tasks with class imbalance. This paper challenges this notion through novel mathematical analysis, illustrating that AUROC and AUPRC can be concisely related in probabilistic terms. We demonstrate that AUPRC, contrary to popular belief, is not superior in cases of class imbalance and might even be a harmful metric, given its inclination to unduly favor model improvements in subpopulations with more frequent positive labels. This bias can inadvertently heighten algorithmic disparities. Prompted by these insights, a thorough review of existing ML literature was conducted, utilizing large language models to analyze over 1.5 million papers from arXiv. Our investigation focused on the prevalence and substantiation of the purported AUPRC superiority. The results expose a significant deficit in empirical backing and a trend of misattributions that have fuelled the widespread acceptance of AUPRC's supposed advantages. Our findings represent a dual contribution: a significant technical advancement in understanding metric behaviors and a stark warning about unchecked assumptions in the ML community. All experiments are accessible at https://github.com/mmcdermott/AUC_is_all_you_need.
翻译:在机器学习(ML)领域,一个广为流传的观点是:对于存在类别不平衡的二分类任务,相较于接收者操作特征曲线下面积(AUROC),精确率-召回率曲线下面积(AUPRC)是更优的模型比较指标。本文通过全新的数学分析挑战了这一观点,揭示了AUROC与AUPRC在概率框架下可被简洁关联。我们证明,与普遍认知相反,AUPRC在类别不平衡场景下并非更优,甚至可能成为有害指标——因其倾向于不恰当地奖励正向标签更频繁的子群体中的模型改进。这种偏差可能无意中加剧算法差异。受此启发,我们利用大语言模型对arXiv上超过150万篇论文进行了系统性文献综述,聚焦于AUPRC优越性的宣称普及度与实证依据。结果揭示:所谓AUPRC优势的广泛接受,源于显著的经验证据缺失与错误归因趋势。本研究具有双重贡献:既是对指标行为理解的重要技术进步,亦是对机器学习社区中未经验证假设的严厉警示。所有实验均可通过https://github.com/mmcdermott/AUC_is_all_you_need 访问。