Vision-language models (VLMs) demonstrate impressive capabilities in coarse-grained tasks like image classification and retrieval. However, they struggle with fine-grained tasks that require localized understanding. To investigate this weakness, we comprehensively analyze CLIP features and identify an important issue: semantic features are highly correlated. Specifically, the features of a class encode information about other classes, which we call mutual feature information (MFI). This mutual information becomes evident when we query a specific class and unrelated objects are activated along with the target class. To address this issue, we propose Unmix-CLIP, a novel framework designed to reduce MFI and improve feature disentanglement. We introduce MFI loss, which explicitly separates text features by projecting them into a space where inter-class similarity is minimized. To ensure a corresponding separation in image features, we use multi-label recognition (MLR) to align the image features with the separated text features. This ensures that both image and text features are disentangled and aligned across modalities, improving feature separation for downstream tasks. For the COCO- 14 dataset, Unmix-CLIP reduces feature similarity by 24.9%. We demonstrate its effectiveness through extensive evaluations of MLR and zeroshot semantic segmentation (ZS3). In MLR, our method performs competitively on the VOC2007 and surpasses SOTA approaches on the COCO-14 dataset, using fewer training parameters. Additionally, Unmix-CLIP consistently outperforms existing ZS3 methods on COCO and VOC
翻译:视觉语言模型(VLM)在图像分类与检索等粗粒度任务中展现出卓越性能,但在需要局部理解的细粒度任务上表现欠佳。为探究此局限性,我们对CLIP特征进行了全面分析,发现一个关键问题:语义特征之间存在高度相关性。具体而言,某个类别的特征编码了其他类别的信息,我们将此现象称为互特征信息(MFI)。当我们查询特定类别时,这种互信息会表现为无关对象与目标类别同时被激活。为解决该问题,我们提出了Unmix-CLIP——一种旨在降低MFI并提升特征解耦能力的新型框架。我们设计了MFI损失函数,通过将文本特征投影至类间相似度最小化的空间来实现显式分离。为确保图像特征的对应分离,我们采用多标签识别(MLR)方法将图像特征与分离后的文本特征对齐。该机制保证了跨模态图像与文本特征的双重解耦与对齐,从而提升下游任务的特征分离效果。在COCO-14数据集上,Unmix-CLIP将特征相似度降低了24.9%。我们通过对MLR和零样本语义分割(ZS3)的广泛评估验证了其有效性:在MLR任务中,本方法在VOC2007数据集上表现优异,并在COCO-14数据集上以更少的训练参数超越了现有最优方法;此外,Unmix-CLIP在COCO和VOC数据集上的ZS3任务中持续优于现有方法。