Removing out-of-distribution (OOD) images from noisy images scraped from the Internet is an important preprocessing for constructing datasets, which can be addressed by zero-shot OOD detection with vision language foundation models (CLIP). The existing zero-shot OOD detection setting does not consider the realistic case where an image has both in-distribution (ID) objects and OOD objects. However, it is important to identify such images as ID images when collecting the images of rare classes or ethically inappropriate classes that must not be missed. In this paper, we propose a novel problem setting called in-distribution (ID) detection, where we identify images containing ID objects as ID images, even if they contain OOD objects, and images lacking ID objects as OOD images. To solve this problem, we present a new approach, \textbf{G}lobal-\textbf{L}ocal \textbf{M}aximum \textbf{C}oncept \textbf{M}atching (GL-MCM), based on both global and local visual-text alignments of CLIP features, which can identify any image containing ID objects as ID images. Extensive experiments demonstrate that GL-MCM outperforms comparison methods on both multi-object datasets and single-object ImageNet benchmarks.
翻译:从互联网爬取的噪声图像中剔除分布外(OOD)图像是构建数据集的重要预处理步骤,可通过基于视觉语言基础模型(CLIP)的零样本OOD检测实现。现有零样本OOD检测设置未考虑图像同时包含分布内(ID)对象与OOD对象的实际场景。然而,在收集稀有类别或伦理敏感类别(需确保不遗漏)的图像时,准确识别此类图像为ID图像至关重要。本文提出一种称为分布内(ID)检测的新问题设定:将包含ID对象的图像(即使存在OOD对象)识别为ID图像,将缺乏ID对象的图像识别为OOD图像。针对该问题,我们提出新方法**全局-局部最大概念匹配(GL-MCM)**,该方法基于CLIP特征的全局与局部视觉-文本对齐,能够识别所有包含ID对象的图像为ID图像。大量实验表明,GL-MCM在多对象数据集和单对象ImageNet基准测试中均优于对比方法。