Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

Pre-trained vision-language models, e.g. CLIP, have been increasingly used to address the challenging Open-Vocabulary Segmentation (OVS) task, benefiting from their well-aligned vision-text embedding space. Typical solutions involve either freezing CLIP during training to unilaterally maintain its zero-shot capability, or fine-tuning CLIP vision encoder to achieve perceptual sensitivity to local regions. However, few of them incorporate vision-text collaborative optimization. Based on this, we propose the Content-Dependent Transfer to adaptively enhance each text embedding by interacting with the input image, which presents a parameter-efficient way to optimize the text representation. Besides, we additionally introduce a Representation Compensation strategy, reviewing the original CLIP-V representation as compensation to maintain the zero-shot capability of CLIP. In this way, the vision and text representation of CLIP are optimized collaboratively, enhancing the alignment of the vision-text feature space. To the best of our knowledge, we are the first to establish the collaborative vision-text optimizing mechanism within the OVS field. Extensive experiments demonstrate our method achieves superior performance on popular OVS benchmarks. In open-vocabulary semantic segmentation, our method outperforms the previous state-of-the-art approaches by +0.5, +2.3, +3.4, +0.4 and +1.1 mIoU, respectively on A-847, A-150, PC-459, PC-59 and PAS-20. Furthermore, in a panoptic setting on ADE20K, we achieve the performance of 27.1 PQ, 73.5 SQ, and 32.9 RQ. Code will be available at https://github.com/jiaosiyu1999/MAFT-Plus.git .

翻译：预训练的视觉-语言模型（例如CLIP）因其良好对齐的视觉-文本嵌入空间，正被越来越多地用于解决具有挑战性的开放词汇分割任务。典型解决方案通常采用两种策略：在训练期间冻结CLIP以单方面保持其零样本能力，或对CLIP视觉编码器进行微调以实现对局部区域的感知敏感性。然而，这些方法很少涉及视觉-文本协同优化。基于此，我们提出内容依赖迁移方法，通过文本嵌入与输入图像的交互实现自适应增强，这为文本表示优化提供了一种参数高效的方式。此外，我们额外引入表示补偿策略，将原始CLIP视觉表示作为补偿项以保持CLIP的零样本能力。通过这种方式，CLIP的视觉与文本表示得以协同优化，从而增强视觉-文本特征空间的对齐性。据我们所知，这是在开放词汇分割领域首次建立视觉-文本协同优化机制的研究。大量实验表明，我们的方法在主流开放词汇分割基准测试中取得了优越性能。在开放词汇语义分割任务中，我们的方法在A-847、A-150、PC-459、PC-59和PAS-20数据集上分别以+0.5、+2.3、+3.4、+0.4和+1.1 mIoU的优势超越先前最优方法。此外，在ADE20K全景分割任务中，我们取得了27.1 PQ、73.5 SQ和32.9 RQ的性能表现。代码将在https://github.com/jiaosiyu1999/MAFT-Plus.git 发布。