Open-set object detection aims at detecting arbitrary categories beyond those seen during training. Most recent advancements have adopted the open-vocabulary paradigm, utilizing vision-language backbones to represent categories with language. In this paper, we introduce DE-ViT, an open-set object detector that employs vision-only DINOv2 backbones and learns new categories through example images instead of language. To improve general detection ability, we transform multi-classification tasks into binary classification tasks while bypassing per-class inference, and propose a novel region propagation technique for localization. We evaluate DE-ViT on open-vocabulary, few-shot, and one-shot object detection benchmark with COCO and LVIS. For COCO, DE-ViT outperforms the open-vocabulary SoTA by 6.9 AP50 and achieves 50 AP50 in novel classes. DE-ViT surpasses the few-shot SoTA by 15 mAP on 10-shot and 7.2 mAP on 30-shot and one-shot SoTA by 2.8 AP50. For LVIS, DE-ViT outperforms the open-vocabulary SoTA by 2.2 mask AP and reaches 34.3 mask APr. Code is available at https://github.com/mlzxy/devit.
翻译:开放集目标检测旨在检测训练阶段未见过的任意类别。最新进展大多采用开放词汇范式,利用视觉-语言骨干网络通过语言表示类别。本文提出DE-ViT,一种仅使用视觉DINOv2骨干网络的开放集目标检测器,通过示例图像而非语言学习新类别。为提升通用检测能力,我们将多分类任务转化为二分类任务,同时避免逐类推理,并针对定位任务提出一种新颖的区域传播技术。我们在COCO和LVIS数据集上,对DE-ViT进行了开放词汇、少样本和单样本目标检测基准测试。对于COCO,DE-ViT在开放词汇方法的基础上提升6.9 AP50,且在新类别上达到50 AP50。DE-ViT在10-shot设定下超越少样本最优方法15 mAP,在30-shot设定下超越7.2 mAP,在单样本设定下超越2.8 AP50。对于LVIS,DE-ViT在开放词汇方法基础上提升2.2 mask AP,并达到34.3 mask APr。代码见 https://github.com/mlzxy/devit。