CLIP for Lightweight Semantic Segmentation

The large-scale pretrained model CLIP, trained on 400 million image-text pairs, offers a promising paradigm for tackling vision tasks, albeit at the image level. Later works, such as DenseCLIP and LSeg, extend this paradigm to dense prediction, including semantic segmentation, and have achieved excellent results. However, the above methods either rely on CLIP-pretrained visual backbones or use none-pretrained but heavy backbones such as Swin, while falling ineffective when applied to lightweight backbones. The reason for this is that the lightweitht networks, feature extraction ability of which are relatively limited, meet difficulty embedding the image feature aligned with text embeddings perfectly. In this work, we present a new feature fusion module which tackles this problem and enables language-guided paradigm to be applied to lightweight networks. Specifically, the module is a parallel design of CNN and transformer with a two-way bridge in between, where CNN extracts spatial information and visual context of the feature map from the image encoder, and the transformer propagates text embeddings from the text encoder forward. The core of the module is the bidirectional fusion of visual and text feature across the bridge which prompts their proximity and alignment in embedding space. The module is model-agnostic, which can not only make language-guided lightweight semantic segmentation practical, but also fully exploit the pretrained knowledge of language priors and achieve better performance than previous SOTA work, such as DenseCLIP, whatever the vision backbone is. Extensive experiments have been conducted to demonstrate the superiority of our method.

翻译：大规模预训练模型CLIP在4亿图文对上进行训练，为处理视觉任务提供了有前景的范式（尽管处于图像级别）。后续工作如DenseCLIP和LSeg将这一范式扩展到密集预测任务（包括语义分割），并取得了优异成果。然而，上述方法要么依赖CLIP预训练的视觉骨干网络，要么使用未经预训练但计算量大的骨干网络（如Swin），在应用于轻量级骨干网络时效果不佳。其原因在于轻量级网络的特征提取能力相对有限，难以完美嵌入与文本嵌入对齐的图像特征。本文提出一种新的特征融合模块来解决该问题，使语言引导范式能够应用于轻量级网络。具体而言，该模块采用CNN与Transformer并行设计，中间通过双向桥接通道连接：CNN从图像编码器中提取特征图的空间信息和视觉上下文，而Transformer则将文本编码器的文本嵌入向前传播。该模块的核心是通过桥接通道实现视觉与文本特征的双向融合，促进两者在嵌入空间中的接近与对齐。该模块具有模型无关性，不仅能使语言引导的轻量级语义分割变得可行，还能充分挖掘语言先验的预训练知识，无论视觉骨干网络为何，都能取得优于先前SOTA工作（如DenseCLIP）的性能。大量实验证明了我们方法的优越性。