Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP

Open-vocabulary segmentation is a challenging task requiring segmenting and recognizing objects from an open set of categories. One way to address this challenge is to leverage multi-modal models, such as CLIP, to provide image and text features in a shared embedding space, which bridges the gap between closed-vocabulary and open-vocabulary recognition. Hence, existing methods often adopt a two-stage framework to tackle the problem, where the inputs first go through a mask generator and then through the CLIP model along with the predicted masks. This process involves extracting features from images multiple times, which can be ineffective and inefficient. By contrast, we propose to build everything into a single-stage framework using a shared Frozen Convolutional CLIP backbone, which not only significantly simplifies the current two-stage pipeline, but also remarkably yields a better accuracy-cost trade-off. The proposed FC-CLIP, benefits from the following observations: the frozen CLIP backbone maintains the ability of open-vocabulary classification and can also serve as a strong mask generator, and the convolutional CLIP generalizes well to a larger input resolution than the one used during contrastive image-text pretraining. When training on COCO panoptic data only and testing in a zero-shot manner, FC-CLIP achieve 26.8 PQ, 16.8 AP, and 34.1 mIoU on ADE20K, 18.2 PQ, 27.9 mIoU on Mapillary Vistas, 44.0 PQ, 26.8 AP, 56.2 mIoU on Cityscapes, outperforming the prior art by +4.2 PQ, +2.4 AP, +4.2 mIoU on ADE20K, +4.0 PQ on Mapillary Vistas and +20.1 PQ on Cityscapes, respectively. Additionally, the training and testing time of FC-CLIP is 7.5x and 6.6x significantly faster than the same prior art, while using 5.9x fewer parameters. FC-CLIP also sets a new state-of-the-art performance across various open-vocabulary semantic segmentation datasets. Code at https://github.com/bytedance/fc-clip

翻译：开放词汇分割是一项具有挑战性的任务，要求从开放类别集合中对物体进行分割和识别。解决这一挑战的一种方法是利用多模态模型（例如CLIP）在共享嵌入空间中提供图像和文本特征，从而弥合封闭词汇与开放词汇识别之间的差距。因此，现有方法通常采用两阶段框架来处理该问题：输入首先经过掩码生成器，随后与预测掩码一起送入CLIP模型。该过程需要多次从图像中提取特征，这可能效率低下且效果不佳。相比之下，我们提出将所有内容整合到一个单阶段框架中，使用共享的冻结卷积CLIP骨干网络，这不仅显著简化了当前的两阶段流程，而且卓越地实现了更好的精度-成本权衡。所提出的FC-CLIP受益于以下观察：冻结的CLIP骨干网络保持了开放词汇分类的能力，并且可以作为强大的掩码生成器；卷积CLIP能够很好地泛化到比对比图像-文本预训练时所用的更大的输入分辨率。仅在COCO全景数据上训练并以零样本方式测试时，FC-CLIP在ADE20K上达到了26.8 PQ、16.8 AP和34.1 mIoU，在Mapillary Vistas上达到18.2 PQ、27.9 mIoU，在Cityscapes上达到44.0 PQ、26.8 AP、56.2 mIoU，分别比先前技术方法在ADE20K上提升了+4.2 PQ、+2.4 AP、+4.2 mIoU，在Mapillary Vistas上提升了+4.0 PQ，在Cityscapes上提升了+20.1 PQ。此外，FC-CLIP的训练和测试时间分别比相同先前技术方法快7.5倍和6.6倍，同时使用的参数量减少了5.9倍。FC-CLIP还在各种开放词汇语义分割数据集上设立了新的最先进性能。代码地址：https://github.com/bytedance/fc-clip