Pre-trained vision-language models (VLMs) learn to align vision and language representations on large-scale datasets, where each image-text pair usually contains a bag of semantic concepts. However, existing open-vocabulary object detectors only align region embeddings individually with the corresponding features extracted from the VLMs. Such a design leaves the compositional structure of semantic concepts in a scene under-exploited, although the structure may be implicitly learned by the VLMs. In this work, we propose to align the embedding of bag of regions beyond individual regions. The proposed method groups contextually interrelated regions as a bag. The embeddings of regions in a bag are treated as embeddings of words in a sentence, and they are sent to the text encoder of a VLM to obtain the bag-of-regions embedding, which is learned to be aligned to the corresponding features extracted by a frozen VLM. Applied to the commonly used Faster R-CNN, our approach surpasses the previous best results by 4.6 box AP50 and 2.8 mask AP on novel categories of open-vocabulary COCO and LVIS benchmarks, respectively. Code and models are available at https://github.com/wusize/ovdet.
翻译:预训练的视觉-语言模型(VLM)通过大规模数据集学习对齐视觉与语言表征,其中每个图像-文本对通常包含一组语义概念。然而,现有开放词汇目标检测器仅将区域嵌入与从VLM中提取的相应特征进行单独对齐。这种设计忽略了场景中语义概念的组合结构——尽管该结构可能已被VLM隐式学习。本文提出超越单区域对齐的“区域包嵌入对齐”方法:将上下文关联的区域组合为包,把包内区域嵌入视为句子中的词嵌入,输入VLM文本编码器以获取区域包嵌入,并与冻结VLM提取的对应特征进行对齐学习。在基于Faster R-CNN的框架中,我们的方法在开放词汇COCO和LVIS基准测试的新颖类别上,分别超过此前最优结果4.6个box AP50和2.8个mask AP。代码与模型已开源至https://github.com/wusize/ovdet。