Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). Its effectiveness has led to its widespread adoption as a mainstream architecture for various downstream applications. However, despite its significance, the original Grounding-DINO model lacks comprehensive public technical details due to the unavailability of its training code. To bridge this gap, we present MM-Grounding-DINO, an open-source, comprehensive, and user-friendly baseline, which is built with the MMDetection toolbox. It adopts abundant vision datasets for pre-training and various detection and grounding datasets for fine-tuning. We give a comprehensive analysis of each reported result and detailed settings for reproduction. The extensive experiments on the benchmarks mentioned demonstrate that our MM-Grounding-DINO-Tiny outperforms the Grounding-DINO-Tiny baseline. We release all our models to the research community. Codes and trained models are released at https://github.com/open-mmlab/mmdetection/tree/main/configs/mm_grounding_dino.
翻译:Grounding-DINO是当前最先进的开放集检测模型,可处理开放词汇检测(OVD)、短语定位(PG)和指代表达理解(REC)等多种视觉任务。其有效性已使其成为各类下游应用中广泛采用的主流架构。然而,因缺乏训练代码,原始Grounding-DINO模型虽具重要价值,却缺少全面的公开技术细节。为填补这一空白,我们提出MM-Grounding-DINO——一个基于MMDetection工具箱构建的开源、全面且易用的基准模型。该模型采用海量视觉数据集进行预训练,并利用多种检测与定位数据集进行微调。我们对每个已报告结果及其复现参数设置进行了全面分析。在所述基准上的大量实验表明,我们的MM-Grounding-DINO-Tiny模型性能优于Grounding-DINO-Tiny基线模型。我们将所有模型开源至研究社区。代码与训练模型已发布于https://github.com/open-mmlab/mmdetection/tree/main/configs/mm_grounding_dino。