Contrastive Language-Image Pre-Training (CLIP) is a popular method for learning multimodal latent spaces with well-organized semantics. Despite its wide range of applications, CLIP's latent space is known to fail at handling complex visual-textual interactions. Recent works attempt to address its shortcomings with data-centric or algorithmic approaches. But what if the problem is more fundamental, and lies in the geometry of CLIP? Toward this end, we rigorously analyze CLIP's latent space properties, and prove that no CLIP-like joint embedding space exists which can correctly do any two of the following at the same time: 1. represent basic descriptions and image content, 2. represent attribute binding, 3. represent spatial location and relationships, 4. represent negation. Informed by this analysis, we propose Dense Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models, which solves the fundamental limitations of CLIP by retaining the semantic topology of the image patches and text tokens. This method improves upon the performance of classical CLIP-like joint encoder models on a wide array of benchmarks. We share our code and data here for reproducibility: https://github.com/Raphoo/DCSM_Ideal_CLIP
翻译:对比语言-图像预训练(CLIP)是一种广泛用于学习具有良好组织语义的多模态潜在空间的方法。尽管应用广泛,但已知CLIP的潜在空间在处理复杂的视觉-文本交互方面存在不足。近期研究尝试通过以数据为中心或算法驱动的方法来解决其缺陷。但若问题更为根本,且源于CLIP的几何结构呢?为此,我们严格分析了CLIP潜在空间的性质,并证明不存在任何类CLIP的联合嵌入空间能够同时正确实现以下任意两项功能:1. 表示基本描述与图像内容,2. 表示属性绑定,3. 表示空间位置与关系,4. 表示否定。基于此分析,我们提出密集余弦相似度映射(DCSMs)作为类CLIP模型的一种原则性且可解释的评分方法,该方法通过保留图像块与文本标记的语义拓扑结构,解决了CLIP的根本性局限。此方法在多种基准测试中提升了经典类CLIP联合编码器模型的性能。我们在此公开代码与数据以确保可复现性:https://github.com/Raphoo/DCSM_Ideal_CLIP