Few-shot object detection aims at detecting novel categories given a few example images. Recent methods focus on finetuning strategies, with complicated procedures that prohibit a wider application. In this paper, we introduce DE-ViT, a few-shot object detector without the need for finetuning. DE-ViT's novel architecture is based on a new region-propagation mechanism for localization. The propagated region masks are transformed into bounding boxes through a learnable spatial integral layer. Instead of training prototype classifiers, we propose to use prototypes to project ViT features into a subspace that is robust to overfitting on base classes. We evaluate DE-ViT on few-shot, and one-shot object detection benchmarks with Pascal VOC, COCO, and LVIS. DE-ViT establishes new state-of-the-art results on all benchmarks. Notably, for COCO, DE-ViT surpasses the few-shot SoTA by 15 mAP on 10-shot and 7.2 mAP on 30-shot and one-shot SoTA by 2.8 AP50. For LVIS, DE-ViT outperforms few-shot SoTA by 20 box APr.
翻译:少样本目标检测旨在通过少量示例图像检测新类别。现有方法大多关注微调策略,但其复杂流程限制了广泛应用。本文提出DE-ViT——一种无需微调的少样本目标检测器。DE-ViT基于新型区域传播机制实现定位,通过可学习的空间积分层将传播区域掩码转换为边界框。不同于训练原型分类器,我们利用原型将ViT特征投影至对基类过拟合具有鲁棒性的子空间。我们在Pascal VOC、COCO和LVIS数据集上评估了DE-ViT的少样本与单样本目标检测性能。DE-ViT在所有基准测试中均达到最新最优水平。值得注意的是,在COCO数据集上,DE-ViT在10-shot场景下超越现有少样本SoTA 15 mAP,30-shot场景下超越7.2 mAP,单样本场景下超越2.8 AP50;在LVIS数据集上,DE-ViT将少样本SoTA的box APr提升20个点。