Vision-language pre-training like CLIP has shown promising performance on various downstream tasks such as zero-shot image classification and image-text retrieval. Most of the existing CLIP-alike works usually adopt relatively large image encoders like ResNet50 and ViT, while the lightweight counterparts are rarely discussed. In this paper, we propose a multi-level interaction paradigm for training lightweight CLIP models. Firstly, to mitigate the problem that some image-text pairs are not strictly one-to-one correspondence, we improve the conventional global instance-level alignment objective by softening the label of negative samples progressively. Secondly, a relaxed bipartite matching based token-level alignment objective is introduced for finer-grained alignment between image patches and textual words. Moreover, based on the observation that the accuracy of CLIP model does not increase correspondingly as the parameters of text encoder increase, an extra objective of masked language modeling (MLM) is leveraged for maximizing the potential of the shortened text encoder. In practice, an auxiliary fusion module injecting unmasked image embedding into masked text embedding at different network stages is proposed for enhancing the MLM. Extensive experiments show that without introducing additional computational cost during inference, the proposed method achieves a higher performance on multiple downstream tasks.
翻译:视觉-语言预训练(如CLIP)在零样本图像分类、图文检索等下游任务中展现出优异性能。现有大多数类CLIP方法通常采用ResNet50、ViT等较大规模的图像编码器,而轻量级方案鲜有探讨。本文提出一种用于训练轻量级CLIP模型的多层级交互范式。首先,为缓解部分图文对并非严格一一对应的问题,我们通过渐进软化负样本标签,改进了传统的全局实例级对齐目标。其次,引入基于松弛二分匹配的令牌级对齐目标,实现图像块与文本词之间的细粒度对齐。此外,基于CLIP模型精度不随文本编码器参数量增加而同步提升的观察,采用掩码语言建模(MLM)作为额外目标以充分挖掘缩短型文本编码器的潜力。具体而言,提出一种辅助融合模块,在不同网络阶段将未掩码的图像嵌入注入掩码的文本嵌入中,从而增强MLM效果。大量实验表明,本文方法在不增加推理计算量的前提下,在多个下游任务中取得了更优性能。