Aligned text-image encoders such as CLIP have become the de facto model for vision-language tasks. Furthermore, modality-specific encoders achieve impressive performances in their respective domains. This raises a central question: does an alignment exist between uni-modal vision and language encoders since they fundamentally represent the same physical world? Analyzing the latent spaces structure of vision and language models on image-caption benchmarks using the Centered Kernel Alignment (CKA), we find that the representation spaces of unaligned and aligned encoders are semantically similar. In the absence of statistical similarity in aligned encoders like CLIP, we show that a possible matching of unaligned encoders exists without any training. We frame this as a seeded graph-matching problem exploiting the semantic similarity between graphs and propose two methods - a Fast Quadratic Assignment Problem optimization, and a novel localized CKA metric-based matching/retrieval. We demonstrate the effectiveness of this on several downstream tasks including cross-lingual, cross-domain caption matching and image classification.
翻译:对齐的文本-图像编码器(如CLIP)已成为视觉-语言任务的事实标准模型。此外,模态专用编码器在其各自领域也取得了显著性能。这引出一个核心问题:由于单模态视觉和语言编码器本质上表征相同的物理世界,它们之间是否存在对齐?通过使用中心核对齐(Centered Kernel Alignment, CKA)分析视觉与语言模型在图像-字幕基准上的潜在空间结构,我们发现未对齐编码器与对齐编码器的表示空间在语义上具有相似性。尽管对齐编码器(如CLIP)缺乏统计相似性,但我们证明无需任何训练即可实现未对齐编码器的潜在匹配。我们将此建模为利用图语义相似性的种子图匹配问题,并提出两种方法:快速二次分配问题优化,以及基于局部CKA度量的匹配/检索方法。我们在包括跨语言、跨领域字幕匹配和图像分类在内的多个下游任务中验证了该方法的有效性。