CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks

Contrastive Language-Image Pre-training (CLIP) is a powerful multimodal large vision model that has demonstrated significant benefits for downstream tasks, including many zero-shot learning and text-guided vision tasks. However, we notice some severe problems regarding the model's explainability, which undermines its credibility and impedes related tasks. Specifically, we find CLIP prefers the background regions than the foregrounds according to the predicted similarity map, which contradicts human understanding. Besides, there are obvious noisy activations on the visualization results at irrelevant positions. To address these two issues, we conduct in-depth analyses and reveal the reasons with new findings and evidences. Based on these insights, we propose the CLIP Surgery, a method that enables surgery-like modifications for the inference architecture and features, for better explainability and enhancement in multiple open-vocabulary tasks. The proposed method has significantly improved the explainability of CLIP for both convolutional networks and vision transformers, surpassing existing methods by large margins. Besides, our approach also demonstrates remarkable improvements in open-vocabulary segmentation and multi-label recognition tasks. For examples, the mAP improvement on NUS-Wide multi-label recognition is 4.41% without any additional training, and our CLIP Surgery surpasses the state-of-the-art method by 8.74% at mIoU on Cityscapes open-vocabulary semantic segmentation. Furthermore, our method benefits other tasks including multimodal visualization and interactive segmentation like Segment Anything Model (SAM). The code is available at https://github.com/xmed-lab/CLIP_Surgery

翻译：对比语言-图像预训练（CLIP）是一种强大的多模态大型视觉模型，已在下游任务中展现出显著优势，包括众多零样本学习和文本引导的视觉任务。然而，我们注意到该模型在可解释性方面存在严重问题，这不仅削弱了其可信度，也阻碍了相关任务的进展。具体而言，我们发现根据预测相似度图，CLIP更倾向于关注背景区域而非前景区域，这与人类认知相悖；此外，可视化结果中无关位置存在明显的噪声激活。为解决这两个问题，我们开展了深入分析，并通过新发现和证据揭示了其成因。基于这些见解，我们提出CLIP手术——一种能够对推理架构和特征进行类手术式修改的方法，以提升可解释性，并在多项开放词汇任务中实现性能增强。所提方法显著改善了卷积网络和视觉变换器中CLIP的可解释性，大幅超越现有方法。此外，我们的方法在开放词汇分割和多标签识别任务中同样展现出显著提升。例如，无需额外训练即可使NUS-Wide多标签识别的mAP提升4.41%；在Cityscapes开放词汇语义分割中，我们的CLIP手术在mIoU上超越最先进方法8.74%。进一步地，该方法还可惠及多模态可视化及交互式分割（如Segment Anything Model, SAM）等其他任务。代码开源于https://github.com/xmed-lab/CLIP_Surgery