Deriving reliable region-word alignment from image-text pairs is critical to learn object-level vision-language representations for open-vocabulary object detection. Existing methods typically rely on pre-trained or self-trained vision-language models for alignment, which are prone to limitations in localization accuracy or generalization capabilities. In this paper, we propose CoDet, a novel approach that overcomes the reliance on pre-aligned vision-language space by reformulating region-word alignment as a co-occurring object discovery problem. Intuitively, by grouping images that mention a shared concept in their captions, objects corresponding to the shared concept shall exhibit high co-occurrence among the group. CoDet then leverages visual similarities to discover the co-occurring objects and align them with the shared concept. Extensive experiments demonstrate that CoDet has superior performances and compelling scalability in open-vocabulary detection, e.g., by scaling up the visual backbone, CoDet achieves 37.0 $\text{AP}^m_{novel}$ and 44.7 $\text{AP}^m_{all}$ on OV-LVIS, surpassing the previous SoTA by 4.2 $\text{AP}^m_{novel}$ and 9.8 $\text{AP}^m_{all}$. Code is available at https://github.com/CVMI-Lab/CoDet.
翻译:从图像-文本对中获取可靠的区域-词对齐是学习面向开放词汇目标检测的物体级视觉-语言表征的关键。现有方法通常依赖预训练或自训练的视觉-语言模型进行对齐,但在定位精度或泛化能力方面存在局限性。本文提出CoDet,这是一种通过将区域-词对齐重构为共现物体发现问题来克服对预对齐视觉-语言空间依赖的新方法。直观而言,通过将描述中提及共享概念的图像分组,与共享概念对应的物体应在组内呈现高共现性。CoDet进而利用视觉相似性发现共现物体,并将其与共享概念对齐。大量实验表明,CoDet在开放词汇检测中具有卓越性能与可扩展性:例如,通过扩展视觉骨干网络,CoDet在OV-LVIS上达到37.0 $\text{AP}^m_{novel}$和44.7 $\text{AP}^m_{all}$,较之前最先进方法分别提升4.2 $\text{AP}^m_{novel}$和9.8 $\text{AP}^m_{all}$。代码已开源在https://github.com/CVMI-Lab/CoDet。