This paper presents a novel evaluation framework for Out-of-Distribution (OOD) detection that aims to assess the performance of machine learning models in more realistic settings. We observed that the real-world requirements for testing OOD detection methods are not satisfied by the current testing protocols. They usually encourage methods to have a strong bias towards a low level of diversity in normal data. To address this limitation, we propose new OOD test datasets (CIFAR-10-R, CIFAR-100-R, and ImageNet-30-R) that can allow researchers to benchmark OOD detection performance under realistic distribution shifts. Additionally, we introduce a Generalizability Score (GS) to measure the generalization ability of a model during OOD detection. Our experiments demonstrate that improving the performance on existing benchmark datasets does not necessarily improve the usability of OOD detection models in real-world scenarios. While leveraging deep pre-trained features has been identified as a promising avenue for OOD detection research, our experiments show that state-of-the-art pre-trained models tested on our proposed datasets suffer a significant drop in performance. To address this issue, we propose a post-processing stage for adapting pre-trained features under these distribution shifts before calculating the OOD scores, which significantly enhances the performance of state-of-the-art pre-trained models on our benchmarks.
翻译:本文提出了一种面向分布外(OOD)检测的新型评估框架,旨在更真实的场景下评估机器学习模型的性能。我们发现,当前测试协议无法满足现实世界对OOD检测方法的需求。这些协议通常导致方法在正常数据中偏向低多样性。为解决这一局限,我们提出了新的OOD测试数据集(CIFAR-10-R、CIFAR-100-R和ImageNet-30-R),使研究者能够在真实分布偏移下对OOD检测性能进行基准测试。此外,我们引入了泛化性评分(GS)来度量模型在OOD检测中的泛化能力。实验表明,在现有基准数据集上提升性能未必能改善OOD检测模型在真实场景中的实用性。尽管利用深度预训练特征已被视为OOD检测研究的一个有前景方向,但我们的实验显示,基于我们提出的数据集测试时,最先进的预训练模型性能显著下降。为解决这一问题,我们提出了一种后处理阶段,在计算OOD分数前对分布偏移下的预训练特征进行适配,从而显著提升了最先进预训练模型在我们基准测试上的性能。