FGAseg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic Segmentation

Open-vocabulary segmentation aims to identify and segment specific regions and objects based on text-based descriptions. A common solution is to leverage powerful vision-language models (VLMs), such as CLIP, to bridge the gap between vision and text information. However, VLMs are typically pretrained for image-level vision-text alignment, focusing on global semantic features. In contrast, segmentation tasks require fine-grained pixel-level alignment and detailed category boundary information, which VLMs alone cannot provide. As a result, information extracted directly from VLMs can't meet the requirements of segmentation tasks. To address this limitation, we propose FGAseg, a model designed for fine-grained pixel-text alignment and category boundary supplementation. The core of FGAseg is a Pixel-Level Alignment module that employs a cross-modal attention mechanism and a text-pixel alignment loss to refine the coarse-grained alignment from CLIP, achieving finer-grained pixel-text semantic alignment. Additionally, to enrich category boundary information, we introduce the alignment matrices as optimizable pseudo-masks during forward propagation and propose Category Information Supplementation module. These pseudo-masks, derived from cosine and convolutional similarity, provide essential global and local boundary information between different categories. By combining these two strategies, FGAseg effectively enhances pixel-level alignment and category boundary information, addressing key challenges in open-vocabulary segmentation. Extensive experiments demonstrate that FGAseg outperforms existing methods on open-vocabulary semantic segmentation benchmarks.

翻译：开放词汇分割旨在基于文本描述识别并分割特定区域和对象。一种常见解决方案是利用强大的视觉-语言模型（如CLIP）来弥合视觉与文本信息之间的鸿沟。然而，视觉-语言模型通常针对图像级视觉-文本对齐进行预训练，侧重于全局语义特征。相比之下，分割任务需要细粒度的像素级对齐和精细的类别边界信息，仅靠视觉-语言模型无法提供这些信息。因此，直接从视觉-语言模型中提取的信息无法满足分割任务的需求。为克服这一局限，我们提出FGAseg模型，该模型专为细粒度像素-文本对齐与类别边界补充而设计。FGAseg的核心是像素级对齐模块，该模块采用跨模态注意力机制和文本-像素对齐损失函数，对CLIP的粗粒度对齐进行优化，实现更细粒度的像素-文本语义对齐。此外，为丰富类别边界信息，我们在前向传播过程中引入可优化的对齐矩阵作为伪掩码，并提出类别信息补充模块。这些基于余弦相似度和卷积相似度生成的伪掩码，提供了不同类别间关键的全局与局部边界信息。通过结合这两种策略，FGAseg有效增强了像素级对齐与类别边界信息，解决了开放词汇分割中的关键挑战。大量实验表明，FGAseg在开放词汇语义分割基准测试中优于现有方法。