Out-of-distribution (OOD) detection methods assume that they have test ground truths, i.e., whether individual test samples are in-distribution (IND) or OOD. However, in the real world, we do not always have such ground truths, and thus do not know which sample is correctly detected and cannot compute the metric like AUROC to evaluate the performance of different OOD detection methods. In this paper, we are the first to introduce the unsupervised evaluation problem in OOD detection, which aims to evaluate OOD detection methods in real-world changing environments without OOD labels. We propose three methods to compute Gscore as an unsupervised indicator of OOD detection performance. We further introduce a new benchmark Gbench, which has 200 real-world OOD datasets of various label spaces to train and evaluate our method. Through experiments, we find a strong quantitative correlation betwwen Gscore and the OOD detection performance. Extensive experiments demonstrate that our Gscore achieves state-of-the-art performance. Gscore also generalizes well with different IND/OOD datasets, OOD detection methods, backbones and dataset sizes. We further provide interesting analyses of the effects of backbones and IND/OOD datasets on OOD detection performance. The data and code will be available.
翻译:分布外检测方法假设其拥有测试真实标签,即单个测试样本属于分布内还是分布外。然而,在现实世界中,我们并不总是拥有这样的真实标签,因此无法得知哪些样本被正确检测,也无法计算AUROC等指标来评估不同分布外检测方法的性能。本文首次提出分布外检测中的无监督评估问题,旨在无分布外标签的情况下,于真实世界动态环境中评估分布外检测方法。我们提出了三种方法来计算Gscore作为分布外检测性能的无监督指标。进一步,我们引入新基准数据集Gbench,该数据集包含200个不同标签空间的真实世界分布外数据集,用于训练和评估我们的方法。实验发现,Gscore与分布外检测性能之间存在强定量相关性。大量实验表明,我们的Gscore达到了最先进性能。Gscore在不同分布内/分布外数据集、分布外检测方法、骨干网络及数据集规模下均展现出良好的泛化能力。我们进一步提供了关于骨干网络及分布内/分布外数据集对分布外检测性能影响的深入分析。数据和代码将公开提供。