Open-set object detection aims at detecting arbitrary categories beyond those seen during training. Most recent advancements have adopted the open-vocabulary paradigm, utilizing vision-language backbones to represent categories with language. In this paper, we introduce DE-ViT, an open-set object detector that employs vision-only DINOv2 backbones and learns new categories through example images instead of language. To improve general detection ability, we transform multi-classification tasks into binary classification tasks while bypassing per-class inference, and propose a novel region propagation technique for localization. We evaluate DE-ViT on open-vocabulary, few-shot, and one-shot object detection benchmark with COCO and LVIS. For COCO, DE-ViT outperforms the open-vocabulary SoTA by 6.9 AP50 and achieves 50 AP50 in novel classes. DE-ViT surpasses the few-shot SoTA by 15 mAP on 10-shot and 7.2 mAP on 30-shot and one-shot SoTA by 2.8 AP50. For LVIS, DE-ViT outperforms the open-vocabulary SoTA by 2.2 mask AP and reaches 34.3 mask APr. Code is available at https://github.com/mlzxy/devit.
翻译:开放集目标检测旨在检测训练中未见过的任意类别。近期多数研究采用了开放词汇范式,利用视觉-语言骨干网络以语言形式表示类别。本文提出DE-ViT,一种仅使用视觉DINOv2骨干网络的开放集目标检测器,通过示例图像而非语言学习新类别。为提升通用检测能力,我们将多分类任务转化为二分类任务并绕过逐类推理,同时提出一种新颖的区域传播技术用于定位。我们在COCO和LVIS数据集上评估了DE-ViT在开放词汇、小样本和单样本目标检测任务中的性能。在COCO上,DE-ViT在开放词汇检测中比当前最优方法(SoTA)提升6.9 AP50,并在新类上达到50 AP50;在10-shot和30-shot小样本检测中分别超越SoTA 15 mAP和7.2 mAP,在单样本检测中超越SoTA 2.8 AP50。在LVIS上,DE-ViT在开放词汇检测中比SoTA提升2.2 mask AP,并达到34.3 mask APr。代码已开源:https://github.com/mlzxy/devit。