Open-vocabulary object detection (OVD) requires solid modeling of the region-semantic relationship, which could be learned from massive region-text pairs. However, such data is limited in practice due to significant annotation costs. In this work, we propose RTGen to generate scalable open-vocabulary region-text pairs and demonstrate its capability to boost the performance of open-vocabulary object detection. RTGen includes both text-to-region and region-to-text generation processes on scalable image-caption data. The text-to-region generation is powered by image inpainting, directed by our proposed scene-aware inpainting guider for overall layout harmony. For region-to-text generation, we perform multiple region-level image captioning with various prompts and select the best matching text according to CLIP similarity. To facilitate detection training on region-text pairs, we also introduce a localization-aware region-text contrastive loss that learns object proposals tailored with different localization qualities. Extensive experiments demonstrate that our RTGen can serve as a scalable, semantically rich, and effective source for open-vocabulary object detection and continue to improve the model performance when more data is utilized, delivering superior performance compared to the existing state-of-the-art methods.
翻译:开放词汇目标检测(OVD)需要对区域-语义关系进行扎实建模,这可以从海量的区域-文本对中学习。然而,由于高昂的标注成本,此类数据在实践中十分有限。本文提出RTGen,用于生成可扩展的开放词汇区域-文本对,并证明其能够有效提升开放词汇目标检测的性能。RTGen在可扩展的图像-描述数据上,同时包含文本到区域和区域到文本的生成过程。文本到区域生成由图像修复技术实现,并受我们提出的场景感知修复引导器指导,以确保整体布局的协调性。对于区域到文本生成,我们使用多种提示进行多轮区域级图像描述,并根据CLIP相似度选择最佳匹配文本。为了促进基于区域-文本对的检测训练,我们还引入了一种定位感知的区域-文本对比损失,该损失能够学习适应不同定位质量的候选区域。大量实验表明,我们的RTGen可以作为一个可扩展、语义丰富且有效的开放词汇目标检测数据源,并且随着使用数据的增加持续提升模型性能,相比现有最先进方法取得了更优的性能。