In recent years, open-vocabulary (OV) object detection has attracted increasing research attention. Unlike traditional detection, which only recognizes fixed-category objects, OV detection aims to detect objects in an open category set. Previous works often leverage vision-language (VL) training data (e.g., referring grounding data) to recognize OV objects. However, they only use pairs of nouns and individual objects in VL data, while these data usually contain much more information, such as scene graphs, which are also crucial for OV detection. In this paper, we propose a novel Scene-Graph-Based Discovery Network (SGDN) that exploits scene graph cues for OV detection. Firstly, a scene-graph-based decoder (SGDecoder) including sparse scene-graph-guided attention (SSGA) is presented. It captures scene graphs and leverages them to discover OV objects. Secondly, we propose scene-graph-based prediction (SGPred), where we build a scene-graph-based offset regression (SGOR) mechanism to enable mutual enhancement between scene graph extraction and object localization. Thirdly, we design a cross-modal learning mechanism in SGPred. It takes scene graphs as bridges to improve the consistency between cross-modal embeddings for OV object classification. Experiments on COCO and LVIS demonstrate the effectiveness of our approach. Moreover, we show the ability of our model for OV scene graph detection, while previous OV scene graph generation methods cannot tackle this task.
翻译:近年来,开词汇目标检测吸引了越来越多的研究关注。与仅能识别固定类别目标的传统检测不同,开词汇检测旨在检测开放类别集合中的目标。现有工作通常利用视觉-语言训练数据(如指代定位数据)来识别开词汇目标。然而,它们仅使用视觉-语言数据中的名词与单个目标对的关联,而这些数据往往包含更多信息,例如场景图——对开词汇检测同样至关重要。本文提出一种新型的基于场景图发现网络,通过挖掘场景图线索实现开词汇检测。首先,设计包含稀疏场景图引导注意力机制的基于场景图的解码器,该解码器捕获场景图并利用其发现开词汇目标。其次,提出基于场景图的预测方法,其中构建了基于场景图的偏移回归机制,实现场景图提取与目标定位之间的相互增强。第三,在场景图预测中设计跨模态学习机制,以场景图为桥梁提升开词汇目标分类中跨模态嵌入的一致性。在COCO和LVIS数据集上的实验验证了本方法的有效性。此外,我们展示了模型在开词汇场景图检测任务上的能力,而现有开词汇场景图生成方法无法解决该任务。