OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training on diverse large-scale datasets. However, these approaches still face two primary challenges: (i) how to universally integrate diverse data sources for end-to-end training, and (ii) how to effectively leverage the language-aware capability for region-level cross-modality understanding. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which pre-trains on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from pseudo-label generation by unifying different data sources into detection-centric data. In addition, we propose a Language-Aware Selective Fusion (LASF) module to enable the language-aware ability of the model through a language-aware query selection and fusion process. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmark datasets, achieving state-of-the-art results with an AP of 50.6\% on the COCO dataset and 40.0\% on the LVIS dataset in a zero-shot manner, demonstrating its strong generalization ability. Furthermore, the fine-tuned OV-DINO on COCO achieves 58.4\% AP, outperforming many existing methods with the same backbone. The code for OV-DINO will be available at \href{https://github.com/wanghao9610/OV-DINO}{https://github.com/wanghao9610/OV-DINO}.

翻译：开放词汇检测是一项具有挑战性的任务，因为它要求根据类别名称（包括训练期间未遇到的类别）检测物体。现有方法通过在多样化的大规模数据集上进行预训练，已展现出强大的零样本检测能力。然而，这些方法仍面临两个主要挑战：(i) 如何通用地整合多样化的数据源以进行端到端训练；(ii) 如何有效利用语言感知能力以实现区域级的跨模态理解。为应对这些挑战，我们提出了一种名为OV-DINO的新型统一开放词汇检测方法，该方法在统一框架内，通过语言感知选择性融合在多样化的大规模数据集上进行预训练。具体而言，我们引入了统一数据集成（UniDI）流程，通过将不同数据源统一为以检测为中心的数据，实现端到端训练并消除伪标签生成带来的噪声。此外，我们提出了语言感知选择性融合（LASF）模块，通过语言感知查询选择与融合过程，赋予模型语言感知能力。我们在流行的开放词汇检测基准数据集上评估了所提出的OV-DINO的性能，在零样本设置下，于COCO数据集上取得了50.6% AP，在LVIS数据集上取得了40.0% AP的先进结果，展现了其强大的泛化能力。此外，在COCO上微调的OV-DINO达到了58.4% AP，优于许多使用相同骨干网络的现有方法。OV-DINO的代码将在 \href{https://github.com/wanghao9610/OV-DINO}{https://github.com/wanghao9610/OV-DINO} 提供。