Recently, the zero-shot semantic segmentation problem has attracted increasing attention, and the best performing methods are based on two-stream networks: one stream for proposal mask generation and the other for segment classification using a pre-trained visual-language model. However, existing two-stream methods require passing a great number of (up to a hundred) image crops into the visuallanguage model, which is highly inefficient. To address the problem, we propose a network that only needs a single pass through the visual-language model for each input image. Specifically, we first propose a novel network adaptation approach, termed patch severance, to restrict the harmful interference between the patch embeddings in the pre-trained visual encoder. We then propose classification anchor learning to encourage the network to spatially focus on more discriminative features for classification. Extensive experiments demonstrate that the proposed method achieves outstanding performance, surpassing state-of-theart methods while being 4 to 7 times faster at inference. We release our code at https://github.com/CongHan0808/DeOP.git.
翻译:近期,零样本语义分割问题日益受到关注。当前性能最佳的方法基于双流网络:一个流用于生成候选掩码,另一个流利用预训练的视觉-语言模型进行区域分类。然而,现有双流方法需要将大量(多达上百个)图像裁剪块输入视觉-语言模型,导致效率极低。为解决这一问题,我们提出一种仅需对每张输入图像执行单次视觉-语言模型通行的网络。具体而言,我们首先提出一种名为"面片切断"的新型网络适配方法,以限制预训练视觉编码器中面片嵌入之间的有害干扰。随后引入分类锚点学习机制,促使网络在空间上聚焦更具判别性的特征进行分类。大量实验表明,所提方法在性能上超越当前最优方法的同时,推理速度提升4至7倍。我们已在https://github.com/CongHan0808/DeOP.git 公开代码。