Recently, the contrastive language-image pre-training, e.g., CLIP, has demonstrated promising results on various downstream tasks. The pre-trained model can capture enriched visual concepts for images by learning from a large scale of text-image data. However, transferring the learned visual knowledge to open-vocabulary semantic segmentation is still under-explored. In this paper, we propose a CLIP-based model named SegCLIP for the topic of open-vocabulary segmentation in an annotation-free manner. The SegCLIP achieves segmentation based on ViT and the main idea is to gather patches with learnable centers to semantic regions through training on text-image pairs. The gathering operation can dynamically capture the semantic groups, which can be used to generate the final segmentation results. We further propose a reconstruction loss on masked patches and a superpixel-based KL loss with pseudo-labels to enhance the visual representation. Experimental results show that our model achieves comparable or superior segmentation accuracy on the PASCAL VOC 2012 (+0.3% mIoU), PASCAL Context (+2.3% mIoU), and COCO (+2.2% mIoU) compared with baselines. We release the code at https://github.com/ArrowLuo/SegCLIP.
翻译:近年来,对比语言-图像预训练(如CLIP)在下游任务中展现出显著成效。通过大规模图文数据的学习,预训练模型能够捕获丰富的视觉概念。然而,将所学视觉知识迁移至开放词汇语义分割领域仍鲜有探索。本文提出一种名为SegCLIP的CLIP衍生模型,以无标注方式实现开放词汇分割任务。该模型基于ViT架构,核心理念在于通过图文对训练,利用可学习中心将图像补丁聚合成语义区域。这种聚合操作可动态捕获语义分组,进而生成最终分割结果。我们进一步提出掩码补丁重建损失和基于超像素的伪标签KL损失来增强视觉表征。实验结果表明,相较于基线模型,本方法在PASCAL VOC 2012(mIoU提升0.3%)、PASCAL Context(mIoU提升2.3%)和COCO(mIoU提升2.2%)数据集上达到相当或更优的分割精度。代码已开源至https://github.com/ArrowLuo/SegCLIP。