Dual encoder architectures like CLIP models map two types of inputs into a shared embedding space and predict similarities between them. Despite their success, it is, however, not understood how these models compare their two inputs. Common first-order feature-attribution methods can only provide limited insights into dual-encoders since their predictions depend on feature-interactions rather than on individual features. In this paper, we first derive a second-order method enabling the attribution of predictions by any differentiable dual encoder onto feature-interactions between its inputs. Second, we apply our method to CLIP models and show that they learn fine-grained correspondences between parts of captions and regions in images. They match objects across input modes also account for mismatches. This visual-linguistic grounding ability, however, varies heavily between object classes and exhibits pronounced out-of-domain effects. We can identify individual errors as well as systematic failure categories including object coverage, unusual scenes and correlated contexts.
翻译:CLIP等双编码器架构将两类输入映射至共享嵌入空间并预测其相似性。尽管这些模型取得了成功,但其比较两种输入的具体机制尚未明晰。常见的一阶特征归因方法对双编码器的解释能力有限,因为其预测依赖于特征交互而非独立特征。本文首先推导了一种二阶归因方法,可将任意可微双编码器的预测归因至输入间的特征交互。其次,我们将该方法应用于CLIP模型,证明其能够学习图像区域与字幕片段间的细粒度对应关系。模型不仅能实现跨模态的对象匹配,还能解释不匹配现象。然而,这种视觉-语言 grounding 能力在不同对象类别间存在显著差异,并表现出明显的域外效应。我们既能识别个体错误,也能发现系统性故障类别,包括对象覆盖度、非常规场景和关联语境等问题。