Benefiting from large-scale vision-language pre-training on image-text pairs, open-world detection methods have shown superior generalization ability under the zero-shot or few-shot detection settings. However, a pre-defined category space is still required during the inference stage of existing methods and only the objects belonging to that space will be predicted. To introduce a "real" open-world detector, in this paper, we propose a novel method named CapDet to either predict under a given category list or directly generate the category of predicted bounding boxes. Specifically, we unify the open-world detection and dense caption tasks into a single yet effective framework by introducing an additional dense captioning head to generate the region-grounded captions. Besides, adding the captioning task will in turn benefit the generalization of detection performance since the captioning dataset covers more concepts. Experiment results show that by unifying the dense caption task, our CapDet has obtained significant performance improvements (e.g., +2.1% mAP on LVIS rare classes) over the baseline method on LVIS (1203 classes). Besides, our CapDet also achieves state-of-the-art performance on dense captioning tasks, e.g., 15.44% mAP on VG V1.2 and 13.98% on the VG-COCO dataset.
翻译:受益于在图像-文本对上进行的大规模视觉-语言预训练,开放世界检测方法在零样本或少样本检测设置下展现出优越的泛化能力。然而,现有方法在推理阶段仍需预定义类别空间,且仅预测属于该空间的目标。为引入"真正"的开放世界检测器,本文提出一种名为CapDet的新方法,使其既能在给定类别列表下进行预测,也能直接生成预测边界框的类别。具体而言,我们通过引入额外的密集描述头生成区域级描述,将开放世界检测与密集描述任务统一到一个简洁而有效的框架中。此外,密集描述任务的加入反过来也有助于提升检测的泛化性能,因为描述数据集覆盖了更多概念。实验结果表明,通过统一密集描述任务,我们的CapDet在LVIS(1203类)上相较于基线方法取得了显著性能提升(例如,在LVIS稀有类别上mAP提升+2.1%)。同时,CapDet在密集描述任务上也达到了当前最优性能,例如在VG V1.2上实现15.44%的mAP,在VG-COCO数据集上实现13.98%的mAP。