CLIP has enabled new and exciting joint vision-language applications, one of which is open-vocabulary segmentation, which can locate any segment given an arbitrary text query. In our research, we ask whether it is possible to discover semantic segments without any user guidance in the form of text queries or predefined classes, and label them using natural language automatically? We propose a novel problem zero-guidance segmentation and the first baseline that leverages two pre-trained generalist models, DINO and CLIP, to solve this problem without any fine-tuning or segmentation dataset. The general idea is to first segment an image into small over-segments, encode them into CLIP's visual-language space, translate them into text labels, and merge semantically similar segments together. The key challenge, however, is how to encode a visual segment into a segment-specific embedding that balances global and local context information, both useful for recognition. Our main contribution is a novel attention-masking technique that balances the two contexts by analyzing the attention layers inside CLIP. We also introduce several metrics for the evaluation of this new task. With CLIP's innate knowledge, our method can precisely locate the Mona Lisa painting among a museum crowd. Project page: https://zero-guide-seg.github.io/.
翻译:CLIP模型开创了令人振奋的视觉-语言联合应用新领域,其中开放式词汇分割技术能够根据任意文本查询精准定位图像区域。本研究探讨一个核心问题:能否在不依赖用户提供的文本查询或预定义类别等任何引导信息的情况下,自主发现语义区域并自动生成自然语言标签?为此,我们提出零引导分割这一全新课题,并构建首个基线方案——该方案无需微调或分割数据集,通过联合运用两种预训练通用模型(DINO与CLIP)即可解决该问题。其核心思路分为四步:首先将图像分割为微小超像素块,随后将这些区块编码至CLIP的视觉-语言联合空间,再将其转化为文本标签,最后合并语义相似的相邻区块。研究的关键挑战在于:如何在兼顾全局与局部上下文信息的前提下,为视觉区块编码出具区分性的特征嵌入。我们的主要创新在于提出一种新型注意力掩蔽技术,通过解析CLIP内部注意力层的响应特性实现两种上下文的平衡。此外,我们为这一新任务设计了多项评估指标。实验证明,借助CLIP的本征知识,该方法能精准定位博物馆人群中《蒙娜丽莎》画作的位置。项目主页:https://zero-guide-seg.github.io/。