Contrastive language-image pre-training (CLIP) is a powerful vision-language model that has shown great benefits for various tasks. However, we have identified some issues with its explainability, which undermine its credibility and limit the capacity for related tasks. Specifically, we find that CLIP tends to focus on background regions rather than foregrounds, with noisy activations at irrelevant positions on the visualization results. These phenomena conflict with conventional explainability methods based on the class attention map (CAM), where the raw model can highlight the local foreground regions using global supervision without alignment. To address these problems, we take a closer look at its architecture and features. Based on thorough analyses, we find the raw self-attentions link to inconsistent semantic regions, resulting in the opposite visualization. Besides, the noisy activations are owing to redundant features among categories. Building on these insights, we propose the CLIP Surgery for reliable CAM, a method that allows surgery-like modifications to the inference architecture and features, without further fine-tuning as classical CAM methods. This approach significantly improves the explainability of CLIP, surpassing existing methods by large margins. Besides, it enables multimodal visualization and extends the capacity of raw CLIP on open-vocabulary tasks without extra alignment. The code is available at https://github.com/xmed-lab/CLIP_Surgery.
翻译:对比语言-图像预训练(CLIP)是一种强大的视觉-语言模型,已证明对多种任务具有显著优势。然而,我们发现其可解释性存在若干问题,这些问题削弱了其可信度并限制了相关任务的能力。具体而言,我们发现CLIP倾向于关注背景区域而非前景,在可视化结果中无关位置会出现噪声激活。这些现象与基于类别注意力图(CAM)的传统可解释性方法相矛盾,后者中的原始模型能够在无需对齐的情况下通过全局监督突出局部前景区域。为解决这些问题,我们对其架构与特征进行了深入探究。基于全面分析,我们发现原始自注意力机制关联了不一致的语义区域,从而导致相反的可视化效果。此外,噪声激活源于类别间的冗余特征。基于这些发现,我们提出了用于可靠CAM的CLIP Surgery方法——一种允许对推理架构和特征进行类手术式修改的方法,无需像经典CAM方法那样进行额外微调。该方法显著提升了CLIP的可解释性,以较大优势超越现有方法。同时,它支持多模态可视化,并在无需额外对齐的情况下扩展了原始CLIP在开放词汇任务上的能力。代码发布于https://github.com/xmed-lab/CLIP_Surgery。