Recent success of large-scale Contrastive Language-Image Pre-training (CLIP) has led to great promise in zero-shot semantic segmentation by transferring image-text aligned knowledge to pixel-level classification. However, existing methods usually require an additional image encoder or retraining/tuning the CLIP module. Here, we propose a novel Zero-shot segmentation with Optimal Transport (ZegOT) method that matches multiple text prompts with frozen image embeddings through optimal transport. In particular, we introduce a novel Multiple Prompt Optimal Transport Solver (MPOT), which is designed to learn an optimal mapping between multiple text prompts and visual feature maps of the frozen image encoder hidden layers. This unique mapping method facilitates each of the multiple text prompts to effectively focus on distinct visual semantic attributes. Through extensive experiments on benchmark datasets, we show that our method achieves the state-of-the-art (SOTA) performance over existing Zero-shot Semantic Segmentation (ZS3) approaches.
翻译:大规模对比语言-图像预训练(CLIP)最近的成功通过将图像-文本对齐知识迁移至像素级分类,为零样本语义分割带来了巨大希望。然而,现有方法通常需要额外的图像编码器或对CLIP模块进行重新训练/微调。本文提出了一种基于最优传输的零样本分割(ZegOT)新方法,该方法通过最优传输将多个文本提示与冻结的图像嵌入进行匹配。具体而言,我们引入了一种新颖的多提示最优传输求解器(MPOT),旨在学习多个文本提示与冻结图像编码器隐藏层视觉特征图之间的最优映射。这种独特的映射方法有助于每个文本提示有效聚焦于不同的视觉语义属性。通过在基准数据集上的大量实验,我们证明该方法在现有零样本语义分割(ZS3)方法中达到了最先进的性能(SOTA)。