Multi-modal contrastive models such as CLIP achieve state-of-the-art performance in zero-shot classification by embedding input images and texts on a joint representational space. Recently, a modality gap has been reported in two-encoder contrastive models like CLIP, meaning that the image and text embeddings reside in disjoint areas of the latent space. Previous studies suggest that this gap exists due to 1) the cone effect, 2) mismatched pairs in the dataset, and 3) insufficient training. We show that, even when accounting for all these factors, and even when using the same modality, the contrastive loss actually creates a gap during training. As a result, We propose that the modality gap is inherent to the two-encoder contrastive loss and rename it the contrastive gap. We present evidence that attributes this contrastive gap to low uniformity in CLIP space, resulting in embeddings that occupy only a small portion of the latent space. To close the gap, we adapt the uniformity and alignment properties of unimodal contrastive loss to the multi-modal setting and show that simply adding these terms to the CLIP loss distributes the embeddings more uniformly in the representational space, closing the gap. In our experiments, we show that the modified representational space achieves better performance than default CLIP loss in downstream tasks such as zero-shot image classification and multi-modal arithmetic.
翻译:多模态对比模型(如CLIP)通过将输入图像和文本嵌入联合表征空间,在零样本分类任务中达到了最先进性能。近期研究表明,在CLIP这类双编码器对比模型中存在"模态差距"现象,即图像与文本嵌入在潜在空间中处于互不相交的区域。既往研究认为该差距由以下因素造成:1)锥体效应;2)数据集中的错配样本对;3)训练不充分。本研究发现,即使排除上述所有因素并采用相同模态,对比损失函数在训练过程中仍会自然产生该差距。由此我们提出:模态差距本质上是双编码器对比损失函数的固有属性,并将其重新命名为"对比差距"。我们提供的证据表明,该对比差距源于CLIP空间中过低的均匀性,导致嵌入仅占据潜在空间的极小区域。为弥合这一差距,我们将单模态对比损失的均匀性与对齐性特性适配至多模态场景,通过在CLIP损失函数中直接添加这些项,使嵌入在表征空间中分布更均匀,从而弥合差距。实验表明,相较于默认CLIP损失,改进后的表征空间在零样本图像分类与多模态算术等下游任务中取得了更优性能。