It is widely agreed that open-vocabulary-based approaches outperform classical closed-set training solutions for recognizing unseen objects in images for semantic segmentation. Existing open-vocabulary approaches leverage vision-language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets. However, the text prompts employed in these methods are short phrases based on fixed templates, failing to capture comprehensive object attributes. Moreover, while the CLIP model excels at exploiting image-level features, it is less effective at pixel-level representation, which is crucial for semantic segmentation tasks. In this work, we propose to alleviate the above-mentioned issues by leveraging multiple large-scale models to enhance the alignment between fine-grained visual features and enriched linguistic features. Specifically, our method employs large language models (LLMs) to generate enriched language prompts with diverse visual attributes for each category, including color, shape/size, and texture/material. Additionally, for enhanced visual feature extraction, the SAM model is adopted as a supplement to the CLIP visual encoder through a proposed learnable weighted fusion strategy. Built upon these techniques, our method, termed LMSeg, achieves state-of-the-art performance across all major open-vocabulary segmentation benchmarks. The code will be made available soon.
翻译:学界普遍认为,基于开放词汇的方法在语义分割任务中识别图像未见物体方面优于传统的封闭集训练方案。现有的开放词汇方法利用视觉语言模型(如CLIP),将视觉特征与通过大规模视觉语言数据集预训练获得的丰富语义特征进行对齐。然而,这些方法采用的文本提示是基于固定模板的短短语,难以捕捉全面的物体属性。此外,CLIP模型虽擅长利用图像级特征,但在像素级表征方面效果欠佳,而后者对语义分割任务至关重要。本研究提出通过融合多个大规模模型来缓解上述问题,以增强细粒度视觉特征与丰富语言特征之间的对齐。具体而言,我们的方法采用大语言模型为每个类别生成包含多样化视觉属性(如颜色、形状/尺寸、纹理/材质)的增强语言提示。同时,为提升视觉特征提取能力,我们通过提出的可学习加权融合策略,将SAM模型作为CLIP视觉编码器的补充。基于这些技术,我们提出的LMSeg方法在所有主流开放词汇分割基准测试中均取得了最先进的性能。相关代码即将开源。