We present a comprehensive experimental study on pretrained feature extractors for visual out-of-distribution (OOD) detection. We examine several setups, based on the availability of labels or image captions and using different combinations of in- and out-distributions. Intriguingly, we find that (i) contrastive language-image pretrained models achieve state-of-the-art unsupervised out-of-distribution performance using nearest neighbors feature similarity as the OOD detection score, (ii) supervised state-of-the-art OOD detection performance can be obtained without in-distribution fine-tuning, (iii) even top-performing billion-scale vision transformers trained with natural language supervision fail at detecting adversarially manipulated OOD images. Finally, we argue whether new benchmarks for visual anomaly detection are needed based on our experiments. Using the largest publicly available vision transformer, we achieve state-of-the-art performance across all $18$ reported OOD benchmarks, including an AUROC of 87.6\% (9.2\% gain, unsupervised) and 97.4\% (1.2\% gain, supervised) for the challenging task of CIFAR100 $\rightarrow$ CIFAR10 OOD detection. The code will be open-sourced.
翻译:我们针对视觉分布外(OOD)检测中预训练特征提取器进行了全面的实验研究。基于标签或图像描述的可用性,我们考察了多种实验设置,并结合了不同的内部分布与外部分布组合。令人关注的是,我们发现:(i)对比语言-图像预训练模型利用最近邻特征相似度作为OOD检测评分,在无监督条件下实现了最先进的分布外检测性能;(ii)无需对内部分布进行微调即可获得有监督条件下最先进的OOD检测性能;(iii)即使是在自然语言监督下训练的最优十亿级视觉Transformer,也无法检测到经过对抗性篡改的OOD图像。最后,我们基于实验证据探讨了是否需要为视觉异常检测建立新的基准。通过使用目前最大的公开视觉Transformer,我们在全部18个公开OOD基准测试中均实现了最优性能,包括CIFAR100→CIFAR10 OOD检测这一挑战性任务中,无监督AUROC达87.6%(提升9.2%),有监督AUROC达97.4%(提升1.2%)。相关代码将开源。