Contrastive Language-Image Pre-training (CLIP) stands as one of the most effective and scalable methods for training transferable vision models using paired image and text data. CLIP models are trained using contrastive loss, which typically relies on data augmentations to prevent overfitting and shortcuts. However, in the CLIP training paradigm, data augmentations are exclusively applied to image inputs, while language inputs remain unchanged throughout the entire training process, limiting the exposure of diverse texts to the same image. In this paper, we introduce Language augmented CLIP (LaCLIP), a simple yet highly effective approach to enhance CLIP training through language rewrites. Leveraging the in-context learning capability of large language models, we rewrite the text descriptions associated with each image. These rewritten texts exhibit diversity in sentence structure and vocabulary while preserving the original key concepts and meanings. During training, LaCLIP randomly selects either the original texts or the rewritten versions as text augmentations for each image. Extensive experiments on CC3M, CC12M, RedCaps and LAION-400M datasets show that CLIP pre-training with language rewrites significantly improves the transfer performance without computation or memory overhead during training. Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M. Code is available at https://github.com/LijieFan/LaCLIP.
翻译:对比语言-图像预训练(CLIP)是利用配对的图像和文本数据训练可迁移视觉模型最有效且最可扩展的方法之一。CLIP模型通过对比损失进行训练,通常依赖数据增强来防止过拟合和捷径学习。然而在CLIP训练范式中,数据增强仅应用于图像输入,而语言输入在整个训练过程中保持不变,限制了同一图像所能接触到的多样化文本。本文提出语言增强CLIP(LaCLIP),一种通过语言重写增强CLIP训练的简单而高效的方法。借助大型语言模型的上下文学习能力,我们重写了每张图像关联的文本描述。这些重写文本在保持原始关键概念和语义的同时,展现出句子结构和词汇的多样性。训练过程中,LaCLIP随机选择原始文本或重写版本作为每张图像的文本增强。在CC3M、CC12M、RedCaps和LAION-400M数据集上的大量实验表明,采用语言重写的CLIP预训练在训练过程中无需额外计算或内存开销,即可显著提升迁移性能。具体而言,在ImageNet零样本准确率上,LaCLIP在CC12M上比CLIP提升8.2%,在LAION-400M上提升2.4%。代码发布于https://github.com/LijieFan/LaCLIP。