Existing open-vocabulary image segmentation methods require a fine-tuning step on mask labels and/or image-text datasets. Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. Consequently, the vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However, without fine-tuning, VLMs trained under weak image-text supervision tend to make suboptimal mask predictions. To alleviate these issues, we introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. The recurrent unit is a two-stage segmenter built upon a frozen VLM. Thus, our model retains the VLM's broad vocabulary space and equips it with segmentation ability. Experiments show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of data samples, and sets the new state-of-the-art records for both zero-shot semantic and referring segmentation. Concretely, we improve the current record by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.
翻译:现有开放词汇图像分割方法需要对掩码标签和/或图像-文本数据集进行微调步骤。掩码标签的标注成本高昂,限制了分割数据集中类别的数量。因此,预训练视觉语言模型(VLM)在微调后,其词汇容量被严重削弱。然而,若不经微调,在弱图像-文本监督下训练的VLM往往会做出次优的掩码预测。为缓解这些问题,我们提出一种新颖的循环框架,该框架无需训练即可逐步过滤无关文本并提升掩码质量。该循环单元是一个基于冻结VLM构建的两阶段分割器。因此,我们的模型保留了VLM广阔的词汇空间,并为其赋予了分割能力。实验表明,我们的方法不仅优于无训练的方法,甚至超过了使用数百万数据样本进行微调的方法,在零样本语义分割和指代分割任务上均创下了新的最先进纪录。具体而言,我们在Pascal VOC、COCO Object和Pascal Context数据集上分别将当前纪录提升了28.8、16.0和6.9个mIoU点。