Few-shot object detection (FSOD) benchmarks have advanced techniques for detecting new categories with limited annotations. Existing benchmarks repurpose well-established datasets like COCO by partitioning categories into base and novel classes for pre-training and fine-tuning respectively. However, these benchmarks do not reflect how FSOD is deployed in practice. Rather than only pre-training on a small number of base categories, we argue that it is more practical to fine-tune a foundation model (e.g., a vision-language model (VLM) pre-trained on web-scale data) for a target domain. Surprisingly, we find that zero-shot inference from VLMs like GroundingDINO significantly outperforms the state-of-the-art (48.3 vs. 33.1 AP) on COCO. However, such zero-shot models can still be misaligned to target concepts of interest. For example, trailers on the web may be different from trailers in the context of autonomous vehicles. In this work, we propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external datasets and fine-tuned on K-shots per target class. Further, we note that current FSOD benchmarks are actually federated datasets containing exhaustive annotations for each category on a subset of the data. We leverage this insight to propose simple strategies for fine-tuning VLMs with federated losses. We demonstrate the effectiveness of our approach on LVIS and nuImages, improving over prior work by 5.9 AP.
翻译:少样本目标检测(FSOD)基准测试推动了在有限标注条件下检测新类别的技术发展。现有基准通过将类别划分为基类和新类,分别用于预训练和微调,重新利用了如COCO等成熟数据集。然而,这些基准并未反映FSOD在实际部署中的场景。我们认为,与其仅在少量基类上进行预训练,更实际的做法是针对目标领域微调一个基础模型(例如,在网络规模数据上预训练的视觉语言模型(VLM))。令人惊讶的是,我们发现像GroundingDINO这样的VLM的零样本推理在COCO上显著优于现有技术水平(48.3 vs. 33.1 AP)。然而,此类零样本模型仍可能与目标概念存在偏差。例如,网络上的拖车可能与自动驾驶场景中的拖车不同。在本工作中,我们提出了基础FSOD(Foundational FSOD),这是一种新的基准协议,用于评估在任何外部数据集上预训练、并在每个目标类别上以K样本进行微调的检测器。此外,我们注意到当前FSOD基准实际上是联邦数据集,在数据子集上对每个类别包含详尽标注。利用这一发现,我们提出了结合联邦损失微调VLM的简单策略。我们在LVIS和nuImages上验证了方法的有效性,相比先前工作提升了5.9 AP。