TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training

Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification. The class token in the image encoder is trained to capture the global features to distinguish different text descriptions supervised by contrastive loss, making it highly effective for single-label classification. However, it shows poor performance on multi-label datasets because the global feature tends to be dominated by the most prominent class and the contrastive nature of softmax operation aggravates it. In this study, we observe that the multi-label classification results heavily rely on discriminative local features but are overlooked by CLIP. As a result, we dissect the preservation of patch-wise spatial information in CLIP and proposed a local-to-global framework to obtain image tags. It comprises three steps: (1) patch-level classification to obtain coarse scores; (2) dual-masking attention refinement (DMAR) module to refine the coarse scores; (3) class-wise reidentification (CWR) module to remedy predictions from a global perspective. This framework is solely based on frozen CLIP and significantly enhances its multi-label classification performance on various benchmarks without dataset-specific training. Besides, to comprehensively assess the quality and practicality of generated tags, we extend their application to the downstream task, i.e., weakly supervised semantic segmentation (WSSS) with generated tags as image-level pseudo labels. Experiments demonstrate that this classify-then-segment paradigm dramatically outperforms other annotation-free segmentation methods and validates the effectiveness of generated tags. Our code is available at https://github.com/linyq2117/TagCLIP.

翻译：对比语言-图像预训练（CLIP）在开放词汇分类中展现出令人印象深刻的能力。图像编码器中的类别令牌（class token）被训练用于捕捉全局特征，以区分由对比损失监督的不同文本描述，这使得它对单标签分类十分有效。然而，在多标签数据集上其表现较差，因为全局特征往往被最显著的类别主导，而softmax操作的对比性质加剧了这一问题。在本研究中，我们观察到多标签分类结果高度依赖于具有判别性的局部特征，但CLIP却忽略了这些特征。为此，我们剖析了CLIP中块级空间信息的保留方式，并提出了一种局部到全局的框架来获取图像标签。该框架包含三个步骤：（1）块级分类以获得粗粒度分数；（2）双掩码注意力精炼（DMAR）模块以优化粗粒度分数；（3）类别级重识别（CWR）模块从全局视角修正预测。该框架完全基于冻结的CLIP，无需针对特定数据集进行训练，便能显著提升其在多个基准上的多标签分类性能。此外，为全面评估生成标签的质量与实用性，我们将其应用扩展到下游任务——即基于生成的标签作为图像级伪标签的弱监督语义分割（WSSS）。实验表明，这种“先分类后分割”的范式显著优于其他无标注分割方法，并验证了生成标签的有效性。我们的代码开源地址为：https://github.com/linyq2117/TagCLIP。