Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks. Source code and models are publicly available at: https://lorebianchi98.github.io/Talk2DINO/.
翻译:开放词汇分割(OVS)旨在根据自由形式的文本概念分割图像,无需预定义训练类别。尽管现有视觉语言模型(如CLIP)能够利用Vision Transformer的粗略空间信息生成分割掩码,但由于其图像与文本特征的全局对齐特性,在空间定位方面面临挑战。相反,自监督视觉模型(如DINO)擅长细粒度视觉编码,但缺乏与语言的整合。为弥合这一鸿沟,我们提出Talk2DINO——一种将DINOv2的空间精确性与CLIP的语言理解能力相结合的新型混合方法。该方法通过学习的映射函数将CLIP的文本嵌入对齐至DINOv2的块级特征,且无需微调底层骨干网络。在训练阶段,我们利用DINOv2的注意力图选择性地将局部视觉块与文本嵌入对齐。研究表明,Talk2DINO强大的语义与定位能力能够增强分割过程,产生更自然且噪声更少的分割结果,同时该方法还能有效区分前景物体与背景。实验结果表明,Talk2DINO在多个无监督OVS基准测试中均达到最先进性能。源代码与模型已公开于:https://lorebianchi98.github.io/Talk2DINO/。