Benefiting from large-scale vision-language pre-training on image-text pairs, open-world detection methods have shown superior generalization ability under the zero-shot or few-shot detection settings. However, a pre-defined category space is still required during the inference stage of existing methods and only the objects belonging to that space will be predicted. To introduce a "real" open-world detector, in this paper, we propose a novel method named CapDet to either predict under a given category list or directly generate the category of predicted bounding boxes. Specifically, we unify the open-world detection and dense caption tasks into a single yet effective framework by introducing an additional dense captioning head to generate the region-grounded captions. Besides, adding the captioning task will in turn benefit the generalization of detection performance since the captioning dataset covers more concepts. Experiment results show that by unifying the dense caption task, our CapDet has obtained significant performance improvements (e.g., +2.1% mAP on LVIS rare classes) over the baseline method on LVIS (1203 classes). Besides, our CapDet also achieves state-of-the-art performance on dense captioning tasks, e.g., 15.44% mAP on VG V1.2 and 13.98% on the VG-COCO dataset.
翻译:受益于大规模视觉-语言图像-文本对预训练,开放世界检测方法在零样本或少样本检测设置下展现出优异的泛化能力。然而,现有方法在推理阶段仍需预定义类别空间,且仅对该空间内的物体进行预测。为实现"真正"的开放世界检测器,本文提出一种名为CapDet的新方法,既可对给定类别列表进行预测,也可直接生成预测边界框的类别。具体而言,通过引入附加的密集描述头来生成区域级描述,我们将开放世界检测与密集描述任务统一到一个简单而有效的框架中。此外,由于描述数据集涵盖更多概念,添加描述任务反过来将提升检测性能的泛化能力。实验结果表明,通过统一密集描述任务,我们的CapDet在LVIS(1203类)上相较于基线方法取得了显著性能提升(例如,LVIS稀有类上的mAP提升+2.1%)。同时,CapDet在密集描述任务上也达到了最先进性能,例如在VG V1.2上mAP达15.44%,在VG-COCO数据集上mAP达13.98%。