The advancement of object detection (OD) in open-vocabulary and open-world scenarios is a critical challenge in computer vision. This work introduces OmDet, a novel language-aware object detection architecture, and an innovative training mechanism that harnesses continual learning and multi-dataset vision-language pre-training. Leveraging natural language as a universal knowledge representation, OmDet accumulates a "visual vocabulary" from diverse datasets, unifying the task as a language-conditioned detection framework. Our multimodal detection network (MDN) overcomes the challenges of multi-dataset joint training and generalizes to numerous training datasets without manual label taxonomy merging. We demonstrate superior performance of OmDet over strong baselines in object detection in the wild, open-vocabulary detection, and phrase grounding, achieving state-of-the-art results. Ablation studies reveal the impact of scaling the pre-training visual vocabulary, indicating a promising direction for further expansion to larger datasets. The effectiveness of our deep fusion approach is underscored by its ability to learn jointly from multiple datasets, enhancing performance through knowledge sharing.
翻译:开放词汇与开放世界场景下的目标检测(OD)是计算机视觉领域的关键挑战。本文提出了OmDet——一种新颖的语言感知目标检测架构,以及一种融合持续学习与多数据集视觉语言预训练的创新型训练机制。通过将自然语言作为通用知识表征,OmDet从不同数据集中积累“视觉词汇”,并将任务统一为语言条件检测框架。我们提出的多模态检测网络(MDN)克服了多数据集联合训练的挑战,无需人工标签分类合并即可泛化至众多训练数据集。实验表明,OmDet在野外目标检测、开放词汇检测及短语定位任务中均显著超越强基线方法,取得了最先进的结果。消融研究揭示了扩展预训练视觉词汇的规模效应,为向更大规模数据集拓展指明了方向。我们深度融合方法的有效性通过其多数据集联合学习能力得以印证,知识共享显著提升了检测性能。