Contrastive Language-Image Pre-training (CLIP) stands as one of the most effective and scalable methods for training transferable vision models using paired image and text data. CLIP models are trained using contrastive loss, which typically relies on data augmentations to prevent overfitting and shortcuts. However, in the CLIP training paradigm, data augmentations are exclusively applied to image inputs, while language inputs remain unchanged throughout the entire training process, limiting the exposure of diverse texts to the same image. In this paper, we introduce Language augmented CLIP (LaCLIP), a simple yet highly effective approach to enhance CLIP training through language rewrites. Leveraging the in-context learning capability of large language models, we rewrite the text descriptions associated with each image. These rewritten texts exhibit diversity in sentence structure and vocabulary while preserving the original key concepts and meanings. During training, LaCLIP randomly selects either the original texts or the rewritten versions as text augmentations for each image. Extensive experiments on CC3M, CC12M, RedCaps and LAION-400M datasets show that CLIP pre-training with language rewrites significantly improves the transfer performance without computation or memory overhead during training. Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M. Code is available at https://github.com/LijieFan/LaCLIP.
翻译:对比语言-图像预训练(CLIP)是利用配对图像与文本数据训练可迁移视觉模型最有效且最具可扩展性的方法之一。CLIP模型通过对比损失进行训练,通常依赖数据增强来防止过拟合与捷径学习。然而在CLIP训练范式下,数据增强仅应用于图像输入,而语言输入在整个训练过程中保持不变,导致同一图像接触的文本多样性受限。本文提出语言增强型CLIP(LaCLIP),这是一种通过语言重写增强CLIP训练的简单高效方法。借助大语言模型的上下文学习能力,我们对每张图像的文本描述进行重写。这些重写文本在句子结构与词汇层面呈现多样性,同时保留原始的关键概念与语义。训练过程中,LaCLIP随机选取原始文本或重写版本作为每张图像的文本增强。在CC3M、CC12M、RedCaps和LAION-400M数据集上的大量实验表明,采用语言重写的CLIP预训练可在不增加训练计算量或内存开销的情况下显著提升迁移性能。具体而言,在ImageNet零样本准确率方面,LaCLIP在CC12M数据集上较CLIP提升8.2%,在LAION-400M数据集上提升2.4%。代码已开源至https://github.com/LijieFan/LaCLIP。