Open-ended image understanding tasks gained significant attention from the research community, particularly with the emergence of Vision-Language Models. Open-Vocabulary Segmentation (OVS) methods are capable of performing semantic segmentation without relying on a fixed vocabulary, and in some cases, they operate without the need for training or fine-tuning. However, OVS methods typically require users to specify the vocabulary based on the task or dataset at hand. In this paper, we introduce \textit{Auto-Vocabulary Semantic Segmentation (AVS)}, advancing open-ended image understanding by eliminating the necessity to predefine object categories for segmentation. Our approach, \ours, presents a framework that autonomously identifies relevant class names using enhanced BLIP embeddings, which are utilized for segmentation afterwards. Given that open-ended object category predictions cannot be directly compared with a fixed ground truth, we develop a Large Language Model-based Auto-Vocabulary Evaluator (LAVE) to efficiently evaluate the automatically generated class names and their corresponding segments. Our method sets new benchmarks on datasets such as PASCAL VOC and Context, ADE20K, and Cityscapes for AVS and showcases competitive performance to OVS methods that require specified class names.
翻译:开放式的图像理解任务近年来受到研究社区的广泛关注,尤其是随着视觉语言模型的出现。开放词汇分割(OVS)方法能够在不依赖固定词汇表的情况下执行语义分割,在某些场景下甚至无需训练或微调即可运行。然而,OVS方法通常要求用户根据当前任务或数据集手动指定词汇表。本文提出《自动词汇语义分割》(Auto-Vocabulary Semantic Segmentation, AVS),通过消除分割前预定义物体类别的必要性,进一步推进开放式图像理解研究。我们的方法\ours构建了一个框架,利用增强的BLIP嵌入自主识别相关类别名称,并将其用于后续分割。由于开放式物体类别预测无法直接与固定真值进行比较,我们开发了基于大语言模型的自动词汇评估器(LAVE),用于高效评估自动生成的类别名称及其对应分割结果。我们的方法在PASCAL VOC和Context、ADE20K、Cityscapes等数据集上为AVS设立了新基准,并在需要指定类别名称的OVS方法中展现出具有竞争力的性能。