The emergence of CLIP has opened the way for open-world image perception. The zero-shot classification capabilities of the model are impressive but are harder to use for dense tasks such as image segmentation. Several methods have proposed different modifications and learning schemes to produce dense output. Instead, we propose in this work an open-vocabulary semantic segmentation method, dubbed CLIP-DIY, which does not require any additional training or annotations, but instead leverages existing unsupervised object localization approaches. In particular, CLIP-DIY is a multi-scale approach that directly exploits CLIP classification abilities on patches of different sizes and aggregates the decision in a single map. We further guide the segmentation using foreground/background scores obtained using unsupervised object localization methods. With our method, we obtain state-of-the-art zero-shot semantic segmentation results on PASCAL VOC and perform on par with the best methods on COCO. The code is available at http://github.com/wysoczanska/clip-diy
翻译:CLIP的出现为开放世界图像感知开辟了道路。该模型的零样本分类能力令人印象深刻,但较难应用于图像分割等密集任务。已有多种方法提出了不同的改进和学习方案以生成密集输出。本文提出了一种无需额外训练或标注的开放词汇语义分割方法,即CLIP-DIY,该方法利用现有的无监督目标定位方法。具体而言,CLIP-DIY是一种多尺度方法,它直接利用CLIP对不同尺寸图像块的分类能力,并将决策聚合至单一图中。我们进一步利用无监督目标定位方法获得的前景/背景分数来引导分割。通过我们的方法,在PASCAL VOC上取得了最先进的零样本语义分割结果,并在COCO上与最佳方法性能相当。代码开源于http://github.com/wysoczanska/clip-diy