Semantic Segmentation is one of the most challenging vision tasks, usually requiring large amounts of training data with expensive pixel level annotations. With the success of foundation models and especially vision-language models, recent works attempt to achieve zeroshot semantic segmentation while requiring either large-scale training or additional image/pixel level annotations. In this work, we generate free annotations for any semantic segmentation dataset using existing foundation models. We use CLIP to detect objects and SAM to generate high quality object masks. Next, we build a lightweight module on top of a self-supervised vision encoder, DinoV2, to align the patch features with a pretrained text encoder for zeroshot semantic segmentation. Our approach can bring language-based semantics to any pretrained vision encoder with minimal training. Our module is lightweight, uses foundation models as the sole source of supervision and shows impressive generalization capability from little training data with no annotation.
翻译:语义分割是最具挑战性的视觉任务之一,通常需要大量带有昂贵像素级标注的训练数据。随着基础模型尤其是视觉语言模型的成功,近期研究尝试实现零样本语义分割,但仍需大规模训练或额外的图像/像素级标注。本研究利用现有基础模型为任意语义分割数据集生成免费标注。我们使用CLIP检测物体,并利用SAM生成高质量物体掩码。随后,我们在自监督视觉编码器DinoV2之上构建轻量级模块,将图像块特征与预训练文本编码器对齐,以实现零样本语义分割。该方法能够以极少的训练量,为任意预训练视觉编码器注入基于语言的语义理解。我们的模块具有轻量化特性,仅以基础模型作为监督信号源,并在无需标注的少量训练数据上展现出卓越的泛化能力。