Semantic Segmentation is one of the most challenging vision tasks, usually requiring large amounts of training data with expensive pixel-level annotations. With the success of foundation models and especially vision-language models, recent works attempt to achieve zero-shot semantic segmentation while requiring either large scale training or additional image/pixel-level annotations. In this work, we build a lightweight module on top of a self-supervised pretrained vision encoder to align patch features with a pre-trained text encoder. Importantly, we generate free annotations for any semantic segmentation dataset using existing foundation models and train our alignment module cost free. We use CLIP to detect objects and SAM to generate high quality object masks. Our approach can bring language-based semantics to any pre-trained vision encoder with minimal training. Our module is lightweight, uses foundation models as a sole source of supervision and shows impressive generalization capability from little training data with no annotation.
翻译:语义分割是视觉任务中最具挑战性的问题之一,通常需要大量带有昂贵像素级标注的训练数据。随着基础模型特别是视觉语言模型的发展,近期研究尝试实现零样本语义分割,但仍需大规模训练或额外图像/像素级标注。本文在自监督预训练视觉编码器的基础上构建轻量级模块,将图像块特征与预训练文本编码器对齐。关键地,我们利用现有基础模型为任意语义分割数据集生成免费标注,并零成本训练对齐模块。通过CLIP检测物体、SAM生成高质量物体掩码,我们的方法能以最小训练代价为任意预训练视觉编码器赋予语言语义。该轻量级模块仅以基础模型为监督来源,在无需标注的少量训练数据上展现出惊人的泛化能力。