This work aims to leverage pre-trained foundation models, such as contrastive language-image pre-training (CLIP) and segment anything model (SAM), to address weakly supervised semantic segmentation (WSSS) using image-level labels. To this end, we propose a coarse-to-fine framework based on CLIP and SAM for generating high-quality segmentation seeds. Specifically, we construct an image classification task and a seed segmentation task, which are jointly performed by CLIP with frozen weights and two sets of learnable task-specific prompts. A SAM-based seeding (SAMS) module is designed and applied to each task to produce either coarse or fine seed maps. Moreover, we design a multi-label contrastive loss supervised by image-level labels and a CAM activation loss supervised by the generated coarse seed map. These losses are used to learn the prompts, which are the only parts need to be learned in our framework. Once the prompts are learned, we input each image along with the learned segmentation-specific prompts into CLIP and the SAMS module to produce high-quality segmentation seeds. These seeds serve as pseudo labels to train an off-the-shelf segmentation network like other two-stage WSSS methods. Experiments show that our method achieves the state-of-the-art performance on PASCAL VOC 2012 and competitive results on MS COCO 2014.
翻译:本研究旨在利用预训练基础模型(如对比语言-图像预训练模型CLIP和分割一切模型SAM)解决基于图像级标签的弱监督语义分割(WSSS)问题。为此,我们提出一种基于CLIP和SAM的由粗到精框架,用于生成高质量分割种子。具体而言,我们构建了图像分类任务和种子分割任务,这两个任务由权重冻结的CLIP与两组可学习的任务特定提示共同执行。针对每个任务,我们设计了基于SAM的种子生成模块(SAMS)以生成粗粒度或细粒度种子图。此外,我们设计了由图像级标签监督的多标签对比损失以及由生成的粗种子图监督的类激活图(CAM)损失。这些损失用于学习提示参数——本框架中唯一需要学习的部分。一旦提示参数学习完成,我们将每幅图像与学习到的分割特定提示共同输入CLIP和SAMS模块,从而生成高质量的分割种子。这些种子作为伪标签,用于训练现有的分割网络(如其他两阶段WSSS方法)。实验表明,本方法在PASCAL VOC 2012数据集上达到最先进性能,在MS COCO 2014数据集上也取得具有竞争力的结果。