Contrastive Language-Image Pretraining (CLIP) performs zero-shot image classification by mapping images and textual class representation into a shared embedding space, then retrieving the class closest to the image. This work provides a new approach for interpreting CLIP models for image classification from the lens of mutual knowledge between the two modalities. Specifically, we ask: what concepts do both vision and language CLIP encoders learn in common that influence the joint embedding space, causing points to be closer or further apart? We answer this question via an approach of textual concept-based explanations, showing their effectiveness, and perform an analysis encompassing a pool of 13 CLIP models varying in architecture, size and pretraining datasets. We explore those different aspects in relation to mutual knowledge, and analyze zero-shot predictions. Our approach demonstrates an effective and human-friendly way of understanding zero-shot classification decisions with CLIP.
翻译:对比语言-图像预训练(CLIP)通过将图像和文本类别表示映射到共享嵌入空间,并检索与图像最接近的类别来实现零样本图像分类。本研究提出了一种从双模态间互知识视角解释CLIP图像分类模型的新方法。具体而言,我们探究以下问题:视觉与语言CLIP编码器共同学习哪些影响联合嵌入空间、导致嵌入点距离变化的共享概念?我们通过基于文本概念的解释方法回答该问题,验证其有效性,并对包含13种不同架构、规模和预训练数据集的CLIP模型进行系统性分析。我们从互知识关联角度探索这些不同维度,并分析零样本预测机制。本方法为理解CLIP的零样本分类决策提供了一种高效且符合人类认知的解释途径。