The recent success of CLIP has demonstrated promising results in zero-shot semantic segmentation by transferring muiltimodal knowledge to pixel-level classification. However, leveraging pre-trained CLIP knowledge to closely align text embeddings with pixel embeddings still has limitations in existing approaches. To address this issue, we propose OTSeg, a novel multimodal attention mechanism aimed at enhancing the potential of multiple text prompts for matching associated pixel embeddings. We first propose Multi-Prompts Sinkhorn (MPS) based on the Optimal Transport (OT) algorithm, which leads multiple text prompts to selectively focus on various semantic features within image pixels. Moreover, inspired by the success of Sinkformers in unimodal settings, we introduce the extension of MPS, called Multi-Prompts Sinkhorn Attention (MPSA), which effectively replaces cross-attention mechanisms within Transformer framework in multimodal settings. Through extensive experiments, we demonstrate that OTSeg achieves state-of-the-art (SOTA) performance with significant gains on Zero-Shot Semantic Segmentation (ZS3) tasks across three benchmark datasets.
翻译:CLIP在零样本语义分割任务中的近期成功,展示了通过将多模态知识迁移至像素级分类的显著潜力。然而,现有方法在利用预训练CLIP知识实现文本嵌入与像素嵌入的紧密对齐方面仍存在局限。为解决该问题,我们提出OTSeg——一种新型多模态注意力机制,旨在增强多个文本提示匹配相关像素嵌入的潜力。我们首先基于最优传输(OT)算法提出多提示Sinkhorn(MPS),使多个文本提示能够选择性地聚焦于图像像素中的不同语义特征。此外,受Sinkformer在单模态场景中成功应用的启发,我们引入MPS的扩展——多提示Sinkhorn注意力(MPSA),该机制在多模态场景中有效替代了Transformer框架内的交叉注意力机制。通过大量实验证明,OTSeg在三个基准数据集的零样本语义分割(ZS3)任务中均取得显著性能提升,达到当前最优(SOTA)水平。