Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RANKCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RANKCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RANKCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.
翻译:自监督对比学习模型(如CLIP)已在众多下游任务中为视觉-语言模型设立了新的基准。然而,这些模型对严格一对一映射的依赖,忽视了文本与图像之间及各自内部复杂且往往多层面的关联。为此,我们提出了RANKCLIP——一种新颖的预训练方法,它突破了CLIP及其变体所采用的严格一对一匹配框架。通过将传统的成对损失扩展为列表损失,并利用模态内与跨模态的排序一致性,RANKCLIP改进了对齐过程,使其能够捕捉各模态之间及内部细微的多对多关系。通过全面的实验,我们验证了RANKCLIP在多种下游任务中的有效性,特别是在零样本分类任务上相比现有最优方法取得了显著提升,从而凸显了这种增强学习过程的重要性。