Extracting in-distribution (ID) images from noisy images scraped from the Internet is an important preprocessing for constructing datasets, which has traditionally been done manually. Automating this preprocessing with deep learning techniques presents two key challenges. First, images should be collected using only the name of the ID class without training on the ID data. Second, as we can see why COCO was created, it is crucial to identify images containing not only ID objects but also both ID and out-of-distribution (OOD) objects as ID images to create robust recognizers. In this paper, we propose a novel problem setting called zero-shot in-distribution (ID) detection, where we identify images containing ID objects as ID images (even if they contain OOD objects), and images lacking ID objects as OOD images without any training. To solve this problem, we leverage the powerful zero-shot capability of CLIP and present a simple and effective approach, Global-Local Maximum Concept Matching (GL-MCM), based on both global and local visual-text alignments of CLIP features. Extensive experiments demonstrate that GL-MCM outperforms comparison methods on both multi-object datasets and single-object ImageNet benchmarks. The code will be available via https://github.com/AtsuMiyai/GL-MCM.
翻译:从互联网爬取的含噪图像中提取分布内图像是构建数据集的重要预处理步骤,传统上依赖人工完成。利用深度学习技术自动化这一预处理流程面临两大挑战:第一,需仅凭分布内类别名称进行图像采集,且不使用分布内数据进行训练;第二,如COCO数据集创建缘由所示,必须识别出既包含分布内目标、又包含分布外目标的图像并将其归为分布内图像,以构建鲁棒的识别器。本文提出称为零样本分布内检测的新问题设定——无需任何训练即可将包含分布内目标的图像(即使同时包含分布外目标)判定为分布内图像,而将不含分布内目标的图像判定为分布外图像。为解决该问题,我们利用CLIP强大的零样本能力,提出基于CLIP特征全局与局部视觉-文本对齐的简单有效方法——全局-局部最大概念匹配(GL-MCM)。大量实验表明,GL-MCM在多目标数据集和单目标ImageNet基准测试中均优于对比方法。代码可通过https://github.com/AtsuMiyai/GL-MCM获取。