Open-vocabulary detection (OVD) is an object detection task aiming at detecting objects from novel categories beyond the base categories on which the detector is trained. Recent OVD methods rely on large-scale visual-language pre-trained models, such as CLIP, for recognizing novel objects. We identify the two core obstacles that need to be tackled when incorporating these models into detector training: (1) the distribution mismatch that happens when applying a VL-model trained on whole images to region recognition tasks; (2) the difficulty of localizing objects of unseen classes. To overcome these obstacles, we propose CORA, a DETR-style framework that adapts CLIP for Open-vocabulary detection by Region prompting and Anchor pre-matching. Region prompting mitigates the whole-to-region distribution gap by prompting the region features of the CLIP-based region classifier. Anchor pre-matching helps learning generalizable object localization by a class-aware matching mechanism. We evaluate CORA on the COCO OVD benchmark, where we achieve 41.7 AP50 on novel classes, which outperforms the previous SOTA by 2.4 AP50 even without resorting to extra training data. When extra training data is available, we train CORA$^+$ on both ground-truth base-category annotations and additional pseudo bounding box labels computed by CORA. CORA$^+$ achieves 43.1 AP50 on the COCO OVD benchmark and 28.1 box APr on the LVIS OVD benchmark.
翻译:摘要:开放词汇检测(OVD)是一项目标检测任务,旨在检测超出检测器训练基础类别范围的新颖类别物体。当前OVD方法依赖大规模视觉-语言预训练模型(如CLIP)来识别新颖物体。本文识别出将此类模型融入检测器训练时需要解决的两大核心障碍:(1)将基于全图训练的视觉-语言模型应用于区域识别任务时产生的分布不匹配问题;(2)对未见类别物体进行定位的困难。为克服这些障碍,我们提出CORA——一种基于DETR风格框架,通过区域提示与锚点预匹配机制适配CLIP的开放词汇检测方法。区域提示通过增强CLIP区域分类器的区域特征来缓解全局到局部的分布差距;锚点预匹配则通过类别感知匹配机制辅助学习可泛化的目标定位能力。我们在COCO OVD基准测试中评估CORA,在新颖类别上达到41.7 AP50,较之前最优方法提升2.4 AP50,且无需借助额外训练数据。当额外训练数据可用时,我们基于基础类别标注的真值框与CORA生成的伪边界框标签联合训练CORA$^+$。CORA$^+$在COCO OVD基准上达到43.1 AP50,在LVIS OVD基准上达到28.1 box APr。