In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a $52.5$ AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP. Code will be available at \url{https://github.com/IDEA-Research/GroundingDINO}.
翻译:本文提出了一种名为Grounding DINO的开放集目标检测器,通过将基于Transformer的检测器DINO与接地预训练相结合,能够根据类别名称或指代表达等人类输入检测任意目标。开放集目标检测的关键解决方案是将语言引入封闭集检测器以实现开放集概念泛化。为有效融合语言与视觉模态,我们将封闭集检测器概念性地划分为三个阶段,并提出了一种紧密融合方案,包括特征增强器、语言引导的查询选择以及用于跨模态融合的跨模态解码器。以往工作主要针对新类别评估开放集目标检测,本文则进一步提出对具有属性描述的目标进行指代表达理解评估。Grounding DINO在COCO、LVIS、ODinW和RefCOCO/+/g等基准测试的所有三种设定下均表现优异。在无需任何COCO训练数据的零样本迁移基准测试中,Grounding DINO达到52.5 AP;在ODinW零样本基准测试中以平均26.1 AP创造新纪录。代码将发布于\url{https://github.com/IDEA-Research/GroundingDINO}。