Vision-language foundation models have shown remarkable performance in various zero-shot settings such as image retrieval, classification, or captioning. But so far, those models seem to fall behind when it comes to zero-shot localization of referential expressions and objects in images. As a result, they need to be fine-tuned for this task. In this paper, we show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning. To leverage those capabilities, we propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path. We show that the concept of self-self attention corresponds to clustering, thus enforcing groups of tokens arising from the same object to be similar while preserving the alignment with the language space. To further guide the group formation, we propose a set of regularizations that allows the model to finally generalize across datasets and backbones. We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation. It shows that GEM not only outperforms other training-free open-vocabulary localization methods, but also achieves state-of-the-art results on the recently proposed OpenImagesV7 large-scale segmentation benchmark.
翻译:视觉-语言基础模型在零样本场景下(如图像检索、分类或描述生成)展现出卓越性能。但迄今为止,这些模型在指代表达与图像中物体的零样本定位任务中仍存在不足,通常需要针对该任务进行微调。本文证明,预训练的视觉-语言模型无需任何微调即可实现零样本开放词汇物体定位。为充分利用该能力,我们提出通用定位模块(GEM),该模块将CLIPSurgery引入的值-值注意力机制推广至自-自注意力路径。研究表明,自-自注意力概念本质对应聚类操作,可促使源自同一物体的词元组保持特征相似性,同时维持与语言空间的对齐。为进一步引导分组形成,我们提出一系列正则化方法,使模型最终能够跨数据集和骨干网络进行泛化。我们在多项语义分割基准任务与数据集上评估提出的GEM框架。结果表明,GEM不仅在无训练开放词汇定位方法中表现最优,更在近期发布的大规模分割基准OpenImagesV7上取得了最先进结果。