While there has been a growing research interest in developing out-of-distribution (OOD) detection methods, there has been comparably little discussion around how these methods should be evaluated. Given their relevance for safe(r) AI, it is important to examine whether the basis for comparing OOD detection methods is consistent with practical needs. In this work, we take a closer look at the go-to metrics for evaluating OOD detection, and question the approach of exclusively reducing OOD detection to a binary classification task with little consideration for the detection threshold. We illustrate the limitations of current metrics (AUROC & its friends) and propose a new metric - Area Under the Threshold Curve (AUTC), which explicitly penalizes poor separation between ID and OOD samples. Scripts and data are available at https://github.com/glhr/beyond-auroc
翻译:尽管针对分布外(OOD)检测方法的研究兴趣日益增长,但关于如何评估这些方法的讨论却相对较少。鉴于它们对实现更安全的人工智能具有重要意义,有必要检验比较OOD检测方法的基础是否与实际需求一致。本研究深入审视了评估OOD检测的主流指标,并质疑了仅将OOD检测简化为二元分类任务且几乎不考虑检测阈值的做法。我们揭示了当前指标(AUROC及其同类)的局限性,并提出了一种新指标——阈值曲线下面积(AUTC),该指标明确惩罚了ID与OOD样本之间的低分离度。相关脚本和数据可在https://github.com/glhr/beyond-auroc获取。