CLIP has enabled new and exciting joint vision-language applications, one of which is open-vocabulary segmentation, which can locate any segment given an arbitrary text query. In our research, we ask whether it is possible to discover semantic segments without any user guidance in the form of text queries or predefined classes, and label them using natural language automatically? We propose a novel problem zero-guidance segmentation and the first baseline that leverages two pre-trained generalist models, DINO and CLIP, to solve this problem without any fine-tuning or segmentation dataset. The general idea is to first segment an image into small over-segments, encode them into CLIP's visual-language space, translate them into text labels, and merge semantically similar segments together. The key challenge, however, is how to encode a visual segment into a segment-specific embedding that balances global and local context information, both useful for recognition. Our main contribution is a novel attention-masking technique that balances the two contexts by analyzing the attention layers inside CLIP. We also introduce several metrics for the evaluation of this new task. With CLIP's innate knowledge, our method can precisely locate the Mona Lisa painting among a museum crowd. Project page: https://zero-guide-seg.github.io/.
翻译:CLIP实现了新颖且激动人心的视觉-语言联合应用,其中之一是开放词汇分割技术,它能够根据任意文本查询定位任意分割区域。本研究提出一个关键问题:能否在不依赖任何用户引导(如文本查询或预定义类别)的情况下自动发现语义分割区域,并用自然语言为其标注?我们提出了一个全新的问题——零引导分割,并构建了首个基线方法,该方法利用两个预训练的通用模型DINO和CLIP,无需微调或分割数据集即可解决此问题。总体思路是:首先将图像分割为精细的超分割区域,将其编码至CLIP的视觉-语言空间,转化为文本标签,并合并语义相似的区域。然而核心挑战在于如何将视觉区域编码为兼顾全局与局部上下文信息的区域特定嵌入——这两种信息对识别均至关重要。我们的主要贡献是提出一种新型注意力遮蔽技术,通过分析CLIP内部的注意力层来平衡两类上下文信息。针对这一新任务,我们还引入多项评估指标。凭借CLIP的固有知识,本方法能在博物馆人群中精确定位《蒙娜丽莎》画作。项目页面:https://zero-guide-seg.github.io/。