Existing semantic segmentation approaches are often limited by costly pixel-wise annotations and predefined classes. In this work, we present CLIP-S$^4$ that leverages self-supervised pixel representation learning and vision-language models to enable various semantic segmentation tasks (e.g., unsupervised, transfer learning, language-driven segmentation) without any human annotations and unknown class information. We first learn pixel embeddings with pixel-segment contrastive learning from different augmented views of images. To further improve the pixel embeddings and enable language-driven semantic segmentation, we design two types of consistency guided by vision-language models: 1) embedding consistency, aligning our pixel embeddings to the joint feature space of a pre-trained vision-language model, CLIP; and 2) semantic consistency, forcing our model to make the same predictions as CLIP over a set of carefully designed target classes with both known and unknown prototypes. Thus, CLIP-S$^4$ enables a new task of class-free semantic segmentation where no unknown class information is needed during training. As a result, our approach shows consistent and substantial performance improvement over four popular benchmarks compared with the state-of-the-art unsupervised and language-driven semantic segmentation methods. More importantly, our method outperforms these methods on unknown class recognition by a large margin.
翻译:现有的语义分割方法常常受限于昂贵的像素级标注和预定义类别。在本文中,我们提出了CLIP-S$^4$,该方法利用自监督像素表示学习和视觉-语言模型,无需任何人工标注和未知类别信息即可实现多种语义分割任务(例如无监督、迁移学习、语言驱动分割)。我们首先通过来自图像不同增强视图的像素-片段对比学习来学习像素嵌入。为进一步提升像素嵌入并实现语言驱动的语义分割,我们设计了两种由视觉-语言模型引导的一致性:1)嵌入一致性,将我们的像素嵌入对齐到预训练视觉-语言模型CLIP的联合特征空间;2)语义一致性,迫使我们的模型在一组精心设计、包含已知和未知原型的类别上,与CLIP做出相同的预测。因此,CLIP-S$^4$实现了一项无需在训练过程中使用未知类别信息的无类别语义分割新任务。最终,与最先进的无监督和语言驱动语义分割方法相比,我们的方法在四个常用基准测试上展现出持续且显著的性能提升。更重要的是,我们的方法在未知类别识别上大幅超越了这些方法。