Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding what content resides within an image, but importantly, where that content resides. In this work we examine how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We demonstrate how contemporary vision and language representation learning models based on contrastive losses and large web-based data capture limited object localization information. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information. We measure this performance in terms of zero-shot image recognition, unsupervised bottom-up and top-down semantic segmentations, as well as robustness analyses. We find that the resulting model achieves state-of-the-art results in terms of unsupervised segmentation, and demonstrate that the learned representations are uniquely robust to spurious correlations in datasets designed to probe the causal behavior of vision models.
翻译:近期零样本图像识别的进展表明,视觉-语言模型能够学习到具有高度语义信息的通用视觉表征,这些表征可通过自然语言短语进行任意探查。然而,理解图像不仅是理解图像中包含什么内容,更重要的是理解这些内容位于何处。本研究考察了视觉-语言模型在理解图像中物体位置以及将视觉相关部分进行分组方面的能力。我们论证了基于对比损失和大规模网络数据的当代视觉与语言表征学习模型仅能捕获有限的物体定位信息,并提出一组最小化的修改方案,使得模型能够同时学习语义与空间信息。我们通过零样本图像识别、无监督自底向上与自顶向下语义分割以及鲁棒性分析来评估这一性能。研究发现,改进后的模型在无监督分割任务上达到了最先进水平,同时其学习的表征对旨在探查视觉模型因果行为的数据集中的虚假关联具有独特的鲁棒性。