Out-of-distribution (OOD) detection methods assume that they have test ground truths, i.e., whether individual test samples are in-distribution (IND) or OOD. However, in the real world, we do not always have such ground truths, and thus do not know which sample is correctly detected and cannot compute the metric like AUROC to evaluate the performance of different OOD detection methods. In this paper, we are the first to introduce the unsupervised evaluation problem in OOD detection, which aims to evaluate OOD detection methods in real-world changing environments without OOD labels. We propose three methods to compute Gscore as an unsupervised indicator of OOD detection performance. We further introduce a new benchmark Gbench, which has 200 real-world OOD datasets of various label spaces to train and evaluate our method. Through experiments, we find a strong quantitative correlation betwwen Gscore and the OOD detection performance. Extensive experiments demonstrate that our Gscore achieves state-of-the-art performance. Gscore also generalizes well with different IND/OOD datasets, OOD detection methods, backbones and dataset sizes. We further provide interesting analyses of the effects of backbones and IND/OOD datasets on OOD detection performance. The data and code will be available.
翻译:分布外(OOD)检测方法通常假设其具有测试数据的真实标签,即单个测试样本属于分布内(IND)还是分布外(OOD)。然而,在现实世界中,这类真实标签并不总是存在,因此我们无法获知哪些检测结果是正确的,也无法计算诸如AUROC等指标来评估不同OOD检测方法的性能。本文首次提出OOD检测中的无监督评估问题,旨在无OOD标签的条件下,在真实世界的动态环境中评估OOD检测方法。我们提出了三种方法,通过计算Gscore作为OOD检测性能的无监督指标。此外,我们引入了一个新的基准数据集Gbench,该数据集包含200个具有不同标签空间的真实世界OOD数据集,用于训练和评估我们的方法。实验发现,Gscore与OOD检测性能之间存在强数量相关性。大量实验表明,我们的Gscore达到了最优性能。Gscore在不同IND/OOD数据集、OOD检测方法、骨干网络及数据集规模下均具有良好的泛化能力。我们进一步分析了骨干网络和IND/OOD数据集对OOD检测性能的影响。数据和代码将开源提供。