Vision-Language Models (VLMs) have emerged as promising tools for open-ended image understanding tasks, including open vocabulary segmentation. Yet, direct application of such VLMs to segmentation is non-trivial, since VLMs are trained with image-text pairs and naturally lack pixel-level granularity. Recent works have made advancements in bridging this gap, often by leveraging the shared image-text space in which the image and a provided text prompt are represented. In this paper, we challenge the capabilities of VLMs further and tackle open-vocabulary segmentation without the need for any textual input. To this end, we propose a novel Self-Guided Semantic Segmentation (Self-Seg) framework. Self-Seg is capable of automatically detecting relevant class names from clustered BLIP embeddings and using these for accurate semantic segmentation. In addition, we propose an LLM-based Open-Vocabulary Evaluator (LOVE) to effectively assess predicted open-vocabulary class names. We achieve state-of-the-art results on Pascal VOC, ADE20K and CityScapes for open-vocabulary segmentation without given class names, as well as competitive performance with methods where class names are given. All code and data will be released.
翻译:视觉-语言模型(VLM)已成为开放图像理解任务(包括开放词汇分割)的重要工具。然而,由于VLM基于图像-文本对进行训练,天然缺乏像素级粒度,将其直接应用于分割并非易事。近期研究通过利用图像与给定文本提示共享的图像-文本空间,在弥合这一差距方面取得进展。本文进一步挑战VLM的能力,在无需任何文本输入的情况下解决开放词汇分割问题。为此,我们提出一种新型自引导语义分割(Self-Seg)框架。Self-Seg能够从聚类BLIP嵌入中自动检测相关类名,并利用这些类名实现精确的语义分割。此外,我们提出基于大语言模型的开放词汇评估器(LOVE)来有效评估预测的开放词汇类名。在Pascal VOC、ADE20K和CityScapes数据集上,我们在无给定类名的开放词汇分割任务中取得最优结果,同时在给定类名方法中达到具有竞争力的性能。所有代码与数据将公开发布。