Revisiting Few-Shot Object Detection with Vision-Language Models

Few-shot object detection (FSOD) benchmarks have advanced techniques for detecting new categories with limited annotations. Existing benchmarks repurpose well-established datasets like COCO by partitioning categories into base and novel classes for pre-training and fine-tuning respectively. However, these benchmarks do not reflect how FSOD is deployed in practice. Rather than only pre-training on a small number of base categories, we argue that it is more practical to fine-tune a foundation model (e.g., a vision-language model (VLM) pre-trained on web-scale data) for a target domain. Surprisingly, we find that zero-shot inference from VLMs like GroundingDINO significantly outperforms the state-of-the-art (48.3 vs. 33.1 AP) on COCO. However, such zero-shot models can still be misaligned to target concepts of interest. For example, trailers on the web may be different from trailers in the context of autonomous vehicles. In this work, we propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external datasets and fine-tuned on K-shots per target class. Further, we note that current FSOD benchmarks are actually federated datasets containing exhaustive annotations for each category on a subset of the data. We leverage this insight to propose simple strategies for fine-tuning VLMs with federated losses. We demonstrate the effectiveness of our approach on LVIS and nuImages, improving over prior work by 5.9 AP.

翻译：少样本目标检测（FSOD）基准测试推动了在有限标注条件下检测新类别的技术发展。现有基准通过将类别划分为基类和新类，分别用于预训练和微调，重新利用了如COCO等成熟数据集。然而，这些基准并未反映FSOD在实际部署中的场景。我们认为，与其仅在少量基类上进行预训练，更实际的做法是针对目标领域微调一个基础模型（例如，在网络规模数据上预训练的视觉语言模型（VLM））。令人惊讶的是，我们发现像GroundingDINO这样的VLM的零样本推理在COCO上显著优于现有技术水平（48.3 vs. 33.1 AP）。然而，此类零样本模型仍可能与目标概念存在偏差。例如，网络上的拖车可能与自动驾驶场景中的拖车不同。在本工作中，我们提出了基础FSOD（Foundational FSOD），这是一种新的基准协议，用于评估在任何外部数据集上预训练、并在每个目标类别上以K样本进行微调的检测器。此外，我们注意到当前FSOD基准实际上是联邦数据集，在数据子集上对每个类别包含详尽标注。利用这一发现，我们提出了结合联邦损失微调VLM的简单策略。我们在LVIS和nuImages上验证了方法的有效性，相比先前工作提升了5.9 AP。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日