Contrastive Language--Image Pre-training (CLIP) has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks. Yet, from a geometrical point of view, the CLIP embedding space has been found to have a pronounced modality gap. This gap renders the embedding space overly sparse and disconnected, with different modalities being densely distributed in distinct subregions of the hypersphere. In this work, we aim at answering two main questions: 1. Does sharing the parameter space between the multi-modal encoders reduce the modality gap? 2. Can the gap be mitigated by pushing apart the uni-modal embeddings via intra-modality separation? We design AlignCLIP, in order to answer these questions and show that answers to both questions are positive. Through extensive experiments, we show that AlignCLIP achieves noticeable enhancements in the cross-modal alignment of the embeddings, and thereby, reduces the modality gap, while maintaining the performance across several downstream evaluations, such as zero-shot image classification, zero-shot multi-modal retrieval and zero-shot semantic text similarity.
翻译:对比性语言-图像预训练(CLIP)在零样本分类和跨模态视觉-语言任务中展现出显著改进。然而,从几何视角观察,CLIP嵌入空间被发现存在明显的模态鸿沟。该鸿沟导致嵌入空间过度稀疏且不连贯,不同模态密集分布在超球面的不同子区域中。本研究致力于回答两个核心问题:1. 多模态编码器共享参数空间能否减少模态鸿沟?2. 通过单模态嵌入的模态内分离能否缓解该鸿沟?我们设计了AlignCLIP以验证这些问题,并证明两个问题的答案均为肯定。通过大量实验,我们证明AlignCLIP在嵌入的跨模态对齐方面取得显著提升,从而在保持多项下游任务性能(包括零样本图像分类、零样本多模态检索和零样本语义文本相似度)的同时,有效减少了模态鸿沟。