Vision-language foundation models have shown remarkable performance in various zero-shot settings such as image retrieval, classification, or captioning. But so far, those models seem to fall behind when it comes to zero-shot localization of referential expressions and objects in images. As a result, they need to be fine-tuned for this task. In this paper, we show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning. To leverage those capabilities, we propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path. We show that the concept of self-self attention corresponds to clustering, thus enforcing groups of tokens arising from the same object to be similar while preserving the alignment with the language space. To further guide the group formation, we propose a set of regularizations that allows the model to finally generalize across datasets and backbones. We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation. It shows that GEM not only outperforms other training-free open-vocabulary localization methods, but also achieves state-of-the-art results on the recently proposed OpenImagesV7 large-scale segmentation benchmark.
翻译:视觉-语言基础模型在图像检索、分类或字幕生成等多种零样本场景中展现了卓越性能。然而,当涉及指代表达和图像中物体的零样本定位时,这些模型目前表现欠佳,因此需要针对此任务进行微调。本文证明,预训练的视觉-语言模型无需任何微调即可实现零样本开放词汇物体定位。为利用这些能力,我们提出通用归位模块(GEM),该模块将CLIPSurgery引入的值-值注意力机制泛化为自-自注意力路径。我们证明自-自注意力概念对应于聚类过程,从而强制源自同一物体的令牌组保持相似性,同时保持与语言空间的对齐。为进一步引导组形成,我们提出一系列正则化方法,使模型最终能够在不同数据集和骨干网络上泛化。我们在语义分割的多个基准任务和数据集上评估了所提GEM框架。结果表明,GEM不仅优于其他免训练开放词汇定位方法,还在最近提出的大规模分割基准OpenImagesV7上取得了最先进结果。