Vision-language alignment learned from image-caption pairs has been shown to benefit tasks like object recognition and detection. Methods are mostly evaluated in terms of how well object class names are learned, but captions also contain rich attribute context that should be considered when learning object alignment. It is unclear how methods use this context in learning, as well as whether models succeed when tasks require attribute and object understanding. To address this gap, we conduct extensive analysis of the role of attributes in vision-language models. We specifically measure model sensitivity to the presence and meaning of attribute context, gauging influence on object embeddings through unsupervised phrase grounding and classification via description methods. We further evaluate the utility of attribute context in training for open-vocabulary object detection, fine-grained text-region retrieval, and attribution tasks. Our results show that attribute context can be wasted when learning alignment for detection, attribute meaning is not adequately considered in embeddings, and describing classes by only their attributes is ineffective. A viable strategy that we find to increase benefits from attributes is contrastive training with adjective-based negative captions.
翻译:从图像-文本对中学习到的视觉语言对齐已被证明有益于物体识别与检测等任务。现有方法主要评估物体类别名称的学习效果,但图像描述中还包含丰富的属性上下文信息,这些信息在物体对齐学习过程中也应被充分考虑。目前尚不清楚方法如何利用这些上下文信息进行学习,也不明确当任务需要同时理解属性和物体时模型是否能够成功。为弥补这一研究空白,我们对视觉语言模型中属性的作用进行了深入分析。具体而言,我们通过无监督短语定位和基于描述的分类方法,测量模型对属性上下文存在性及其语义的敏感度,从而评估属性对物体嵌入的影响。此外,我们还评估了属性上下文在开放词汇目标检测、细粒度文本区域检索及属性分类任务中的训练效用。研究结果表明:在检测任务的对齐学习中属性上下文可能被浪费,属性含义在嵌入过程中未得到充分考虑,且仅通过属性描述类别的方式效果不佳。我们发现的提升属性效益的有效策略是使用基于形容词的负例描述进行对比训练。