Zero-shot detection (ZSD), i.e., detection on classes not seen during training, is essential for real world detection use-cases, but remains a difficult task. Recent research attempts ZSD with detection models that output embeddings instead of direct class labels. To this aim, the output of the detection model must be aligned to a learned embedding space such as CLIP. However, this alignment is hindered by detection data sets which are expensive to produce compared to image classification annotations, and the resulting lack of category diversity in the training data. We address this challenge by leveraging the CLIP embedding space in combination with image labels from ImageNet. Our results show that image labels are able to better align the detector output to the embedding space and thus have a high potential for ZSD. Compared to only training on detection data, we see a significant gain by adding image label data of 3.3 mAP for the 65/15 split on COCO on the unseen classes, i.e., we more than double the gain of related work.
翻译:零样本检测(ZSD)旨在检测训练中未见过的类别,这对实际检测场景至关重要,但仍是一项具有挑战性的任务。近期研究尝试通过输出嵌入向量而非直接类别标签的检测模型来实现ZSD。为此,检测模型的输出必须与诸如CLIP这样的学习嵌入空间对齐。然而,这种对齐受到检测数据集的制约——相比于图像分类标注,检测数据集的制作成本高昂,且训练数据中类别多样性不足。我们通过结合ImageNet图像标签与CLIP嵌入空间来解决这一挑战。结果表明,图像标签能更好地将检测器输出与嵌入空间对齐,从而在ZSD中展现出巨大潜力。相较于仅使用检测数据训练,我们通过添加图像标签数据在COCO数据集65/15划分中未见过类别上的mAP提升了3.3,即相关工作的增益提升幅度翻倍以上。