Contrastive Language Image Pretraining (CLIP) has received widespread attention, since its learned representations can be transferred well to various downstream tasks. During the training process of the CLIP model, the InfoNCE objective aligns positive image-text pairs and separates negative ones. We show an underlying representation grouping effect during this process: the InfoNCE objective indirectly groups semantically similar representations together via randomly emerged within-modal anchors. Based on this understanding, in this paper, Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping by boosting its efficiency and increasing its robustness against the modality gap. Specifically, ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge. Further, Prototypical Back Translation (PBT) is proposed to decouple representation grouping from representation alignment, resulting in effective learning of meaningful representations under large modality gap. The PBT also enables us to introduce additional external teachers with richer prior language knowledge. ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data. We train our ProtoCLIP on Conceptual Captions and achieved an +5.81% ImageNet linear probing improvement and an +2.01% ImageNet zero-shot classification improvement. On the larger YFCC-15M dataset, ProtoCLIP matches the performance of CLIP with 33% of training time. Codes are available at https://github.com/megvii-research/protoclip.
翻译:对比语言图像预训练(CLIP)因其学到的表示可迁移至多种下游任务而受到广泛关注。在CLIP模型的训练过程中,InfoNCE目标通过对齐正样本图文对并分离负样本对来优化表示。我们揭示了这一过程中的潜在表示分组效应:InfoNCE目标通过随机出现的模态内锚点间接将语义相似的表示聚合在一起。基于此认识,本文提出原型对比语言图像预训练(ProtoCLIP),通过提升分组效率并增强其应对模态鸿沟的鲁棒性来优化这一分组机制。具体而言,ProtoCLIP在图像和文本空间之间建立原型级判别,高效传递高层级结构知识。进一步地,我们提出原型反向翻译(PBT)将表示分组与表示对齐解耦,从而在较大模态鸿沟下有效学习有意义的表示。PBT还使我们能够引入具有更丰富先验语言知识的外部教师模型。ProtoCLIP采用在线情节训练策略,可扩展至无限量数据。我们在Conceptual Captions数据集上训练ProtoCLIP,在ImageNet线性探针任务上获得+5.81%的提升,零样本分类任务上获得+2.01%的提升。在更大的YFCC-15M数据集上,ProtoCLIP以33%的训练时间达到与CLIP相当的性能。代码开源地址:https://github.com/megvii-research/protoclip。