We investigate the CLIP image encoder by analyzing how individual model components affect the final representation. We decompose the image representation as a sum across individual image patches, model layers, and attention heads, and use CLIP's text representation to interpret the summands. Interpreting the attention heads, we characterize each head's role by automatically finding text representations that span its output space, which reveals property-specific roles for many heads (e.g. location or shape). Next, interpreting the image patches, we uncover an emergent spatial localization within CLIP. Finally, we use this understanding to remove spurious features from CLIP and to create a strong zero-shot image segmenter. Our results indicate that a scalable understanding of transformer models is attainable and can be used to repair and improve models.
翻译:我们通过分析单个模型组件对最终表示的影响来研究CLIP图像编码器。我们将图像表示分解为各图像块、模型层和注意力头贡献的总和,并利用CLIP的文本表示对这些分量进行解读。在解读注意力头时,我们通过自动寻找能跨越其输出空间的文本表示来刻画每个注意力头的作用,从而揭示多个注意力头具有特定属性的功能(例如位置或形状)。接下来,通过解读图像块,我们发现了CLIP中涌现的空间定位能力。最后,我们利用这一理解从CLIP中移除虚假特征,并构建了一个强大的零样本图像分割器。我们的结果表明,对Transformer模型进行可扩展的理解是可行的,并能用于修复和改进模型。