In this paper, we tackle an emerging computer vision task, open-vocabulary universal image segmentation, that aims to perform semantic/instance/panoptic segmentation (background semantic labeling + foreground instance segmentation) for arbitrary categories of text-based descriptions in inference time. We first build a baseline method by directly adopting pre-trained CLIP models without finetuning or distillation. We then develop MaskCLIP, a Transformer-based approach with a MaskCLIP Visual Encoder, which is an encoder-only module that seamlessly integrates mask tokens with a pre-trained ViT CLIP model for semantic/instance segmentation and class prediction. MaskCLIP learns to efficiently and effectively utilize pre-trained partial/dense CLIP features within the MaskCLIP Visual Encoder that avoids the time-consuming student-teacher training process. MaskCLIP outperforms previous methods for semantic/instance/panoptic segmentation on ADE20K and PASCAL datasets. We show qualitative illustrations for MaskCLIP with online custom categories. Project website: https://maskclip.github.io.
翻译:本文针对新兴的计算机视觉任务——开放词汇通用图像分割,即在推理阶段基于文本描述对任意类别执行语义/实例/全景分割(背景语义标注+前景实例分割)展开研究。我们首先构建基线方法,直接采用预训练的CLIP模型而无需微调或知识蒸馏。随后我们提出基于Transformer架构的MaskCLIP方法,其核心为MaskCLIP视觉编码器——该纯编码器模块将掩码令牌与预训练的ViT CLIP模型无缝集成,实现语义/实例分割及类别预测。MaskCLIP通过MaskCLIP视觉编码器高效利用预训练的部分/密集CLIP特征,避免了耗时的师生训练过程。在ADE20K和PASCAL数据集上,MaskCLIP在语义/实例/全景分割任务中均超越现有方法。我们展示了MaskCLIP在自定义在线类别上的定性结果。项目网站:https://maskclip.github.io