The emergence of CLIP has opened the way for open-world image perception. The zero-shot classification capabilities of the model are impressive but are harder to use for dense tasks such as image segmentation. Several methods have proposed different modifications and learning schemes to produce dense output. Instead, we propose in this work an open-vocabulary semantic segmentation method, dubbed CLIP-DIY, which does not require any additional training or annotations, but instead leverages existing unsupervised object localization approaches. In particular, CLIP-DIY is a multi-scale approach that directly exploits CLIP classification abilities on patches of different sizes and aggregates the decision in a single map. We further guide the segmentation using foreground/background scores obtained using unsupervised object localization methods. With our method, we obtain state-of-the-art zero-shot semantic segmentation results on PASCAL VOC and perform on par with the best methods on COCO.
翻译:CLIP的出现为开放世界图像感知开辟了道路。该模型的零样本分类能力令人印象深刻,但较难应用于图像分割等密集任务。已有多种方法提出不同的修改和学习方案来产生密集输出。相反,本文提出一种名为CLIP-DIY的开放词汇语义分割方法,该方法无需任何额外训练或标注,而是利用现有的无监督目标定位方法。具体而言,CLIP-DIY是一种多尺度方法,直接利用CLIP在不同大小图像块上的分类能力,并将决策聚合到单一分割图中。我们进一步利用无监督目标定位方法获得的前景/背景分数来指导分割。通过我们的方法,在PASCAL VOC上实现了最优的零样本语义分割结果,并在COCO上与最佳方法性能持平。