Modern supervised semantic segmentation methods are usually finetuned based on the supervised or self-supervised models pre-trained on ImageNet. Recent work shows that transferring the knowledge from CLIP to semantic segmentation via prompt learning can achieve promising performance. The performance boost comes from the feature enhancement with multimodal alignment, i.e., the dot product between vision and text embeddings. However, how to improve the multimodal alignment for better transfer performance in dense tasks remains underexplored. In this work, we focus on improving the quality of vision-text alignment from two aspects of prompting design and loss function, and present an instance-conditioned prompting with contrastive learning (ICPC) framework. First, compared with the static prompt designs, we reveal that dynamic prompting conditioned on image content can more efficiently utilize the text encoder for complex dense tasks. Second, we propose an align-guided contrastive loss to refine the alignment of vision and text embeddings. We further propose lightweight multi-scale alignment for better performance. Extensive experiments on three large-scale datasets (ADE20K, COCO-Stuff10k, and ADE20K-Full) demonstrate that ICPC brings consistent improvements across diverse backbones. Taking ResNet-50 as an example, ICPC outperforms the state-of-the-art counterpart by 1.71%, 1.05%, and 1.41% mIoU on the three datasets, respectively.
翻译:现代监督式语义分割方法通常基于在ImageNet上预训练的监督或自监督模型进行微调。最新研究表明,通过提示学习将CLIP知识迁移至语义分割任务可取得优异性能。性能提升源于多模态对齐带来的特征增强,即视觉与文本嵌入之间的点积运算。然而,如何提升密集任务中多模态对齐质量以改善迁移性能的问题尚未被充分探索。本文聚焦于从提示设计与损失函数两个维度提升视觉-文本对齐质量,并提出基于实例条件提示与对比学习的ICPC框架。首先,相较于静态提示设计,我们发现基于图像内容的动态提示能更高效地利用文本编码器处理复杂密集任务。其次,我们提出对齐引导的对比损失函数以优化视觉与文本嵌入的对齐效果,并进一步提出轻量级多尺度对齐机制提升性能。在三大规模数据集(ADE20K、COCO-Stuff10k与ADE20K-Full)上的实验表明,ICPC在不同骨干网络中均能实现一致性提升。以ResNet-50为例,ICPC在三个数据集上的mIoU指标较当前最优方法分别提升1.71%、1.05%与1.41%。