Vision-language pretraining to learn a fine-grained, region-word alignment between image-caption pairs has propelled progress in open-vocabulary object detection. We observe that region-word alignment methods are typically used in detection with respect to only object nouns, and the impact of other rich context in captions, such as attributes, is unclear. In this study, we explore how language context affects downstream object detection and propose to enhance the role of context. In particular, we show how to strategically contextualize the grounding pretraining objective for improved alignment. We further hone in on attributes as especially useful object context and propose a novel adjective and noun-based negative sampling strategy for increasing their focus in contrastive learning. Overall, our methods enhance object detection when compared to the state-of-the-art in region-word pretraining. We also highlight the fine-grained utility of an attribute-sensitive model through text-region retrieval and phrase grounding analysis.
翻译:视觉-语言预训练通过学习图像-文本对之间的细粒度区域-词语对齐,推动了开放词汇目标检测的进展。我们注意到,区域-词语对齐方法在检测中通常仅针对目标名词使用,而文本中其他丰富上下文(如属性)的影响尚不明确。本研究探讨语言上下文如何影响下游目标检测,并提出增强上下文作用的方法。具体而言,我们展示了如何策略性地为基础预训练目标赋予上下文信息,以改进对齐效果。进一步聚焦作为特别有效目标上下文的属性,提出一种基于形容词和名词的负采样策略,在对比学习中增强对属性的关注。总体而言,与最先进的区域-词语预训练方法相比,我们的方法提升了目标检测性能。此外,通过文本-区域检索和短语定位分析,突显了属性敏感模型的细粒度实用性。