Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks by scaling up the dataset with image-text pairs collected from the web. However, the presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning. To address this issue, we first utilize the OFA model to generate synthetic captions that focus on the image content. The generated captions contain complementary information that is beneficial for pre-training. Then, we propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption. As the core components of ALIP, the Language Consistency Gate (LCG) and Description Consistency Gate (DCG) dynamically adjust the weights of samples and image-text/caption pairs during the training process. Meanwhile, the adaptive contrastive loss can effectively reduce the impact of noise data and enhances the efficiency of pre-training data. We validate ALIP with experiments on different scales of models and pre-training datasets. Experiments results show that ALIP achieves state-of-the-art performance on multiple downstream tasks including zero-shot image-text retrieval and linear probe. To facilitate future research, the code and pre-trained models are released at https://github.com/deepglint/ALIP.
翻译:对比语言-图像预训练(CLIP)通过大规模扩展从网络采集的图像-文本对数据集,显著提升了多种视觉-语言任务的性能。然而,网络数据中固有的噪声和不匹配的图像-文本对可能影响表示学习的效果。为解决这一问题,我们首先利用OFA模型生成聚焦于图像内容的合成标题,生成的标题包含对预训练有益的互补信息。随后,我们提出自适应语言-图像预训练(ALIP),一种融合原始文本与合成标题监督信号的双路径模型。作为ALIP的核心组件,语言一致性门控(LCG)与描述一致性门控(DCG)在训练过程中动态调整样本及图像-文本/标题对的权重。同时,自适应对比损失能够有效降低噪声数据的影响,提升预训练数据的利用效率。我们通过不同规模模型和预训练数据集的实验验证了ALIP的有效性。实验结果表明,ALIP在多项下游任务(包括零样本图像-文本检索和线性探针)中取得了当前最优性能。为促进后续研究,代码与预训练模型已发布于https://github.com/deepglint/ALIP。