When determining which machine learning model best performs some high impact risk assessment task, practitioners commonly use the Area under the Curve (AUC) to defend and validate their model choices. In this paper, we argue that the current use and understanding of AUC as a model performance metric misunderstands the way the metric was intended to be used. To this end, we characterize the misuse of AUC and illustrate how this misuse negatively manifests in the real world across several risk assessment domains. We locate this disconnect in the way the original interpretation of AUC has shifted over time to the point where issues pertaining to decision thresholds, class balance, statistical uncertainty, and protected groups remain unaddressed by AUC-based model comparisons, and where model choices that should be the purview of policymakers are hidden behind the veil of mathematical rigor. We conclude that current model validation practices involving AUC are not robust, and often invalid.
翻译:在确定哪种机器学习模型最适合执行某项高风险影响评估任务时,从业者通常使用曲线下面积(AUC)来论证和验证其模型选择。本文认为,当前将AUC作为模型性能指标的使用和理解方式,误解了该指标原本的用途。为此,我们刻画了AUC的误用现象,并说明了这种误用如何在多个风险评估领域中实际产生负面影响。我们定位了这一脱节源于AUC原始解释随时间推移而产生的偏移,导致基于AUC的模型比较无法解决决策阈值、类别平衡、统计不确定性以及受保护群体等关键问题,同时,原本应由政策制定者决定的模型选择被隐藏在数学严谨性的帷幕之后。我们的结论是,当前涉及AUC的模型验证实践既不稳健,也常常是无效的。