This paper addresses the challenging problem of open-vocabulary object detection (OVOD) where an object detector must identify both seen and unseen classes in test images without labeled examples of the unseen classes in training. A typical approach for OVOD is to use joint text-image embeddings of CLIP to assign box proposals to their closest text label. However, this method has a critical issue: many low-quality boxes, such as over- and under-covered-object boxes, have the same similarity score as high-quality boxes since CLIP is not trained on exact object location information. To address this issue, we propose a novel method, LP-OVOD, that discards low-quality boxes by training a sigmoid linear classifier on pseudo labels retrieved from the top relevant region proposals to the novel text. Experimental results on COCO affirm the superior performance of our approach over the state of the art, achieving $\textbf{40.5}$ in $\text{AP}_{novel}$ using ResNet50 as the backbone and without external datasets or knowing novel classes during training. Our code will be available at https://github.com/VinAIResearch/LP-OVOD.
翻译:本文致力于解决开放词汇目标检测(OVOD)这一挑战性问题,该任务要求目标检测器在测试图像中识别已见类与未见类,而训练过程中并未提供未见类的标注示例。OVOD的一种典型方法是利用CLIP的联合文本-图像嵌入,将候选框分配至最接近的文本标签。然而,该方法存在一个关键问题:由于CLIP未针对精确目标位置信息进行训练,许多低质量候选框(如过度覆盖或覆盖不足的目标框)与高质量候选框具有相同的相似度得分。为解决此问题,我们提出了一种新颖的方法LP-OVOD,该方法通过对从与新颖文本最相关的区域提议中提取的伪标签训练一个Sigmoid线性分类器,从而剔除低质量候选框。在COCO数据集上的实验结果证实了我们的方法优于现有技术,在使用ResNet50作为骨干网络、且未借助外部数据集或训练期间未知未见类的情况下,$\text{AP}_{novel}$达到了$\textbf{40.5}$。我们的代码将在https://github.com/VinAIResearch/LP-OVOD公开。