Contrastive pre-training on image-text pairs, exemplified by CLIP, becomes a standard technique for learning multi-modal visual-language representations. Although CLIP has demonstrated remarkable performance, training it from scratch on noisy web-scale datasets is computationally demanding. On the other hand, mask-then-predict pre-training approaches, like Masked Image Modeling (MIM), offer efficient self-supervised learning for single-modal representations. This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations. UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder. The pre-trained ViT can be directly applied for zero-shot evaluation even without training on image-text pairs. Compared to MIM approaches, UTA does not suffer from training-finetuning inconsistency and is much more training-efficient by avoiding using the extra [MASK] tokens. Extensive experimental results demonstrate that UTA can enhance CLIP models and outperform existing MIM methods on various uni- and multi-modal benchmarks. Code and models are available at https://github.com/jihaonew/UTA.
翻译:以CLIP为代表的图像-文本对对比预训练已成为学习多模态视觉语言表征的标准技术。尽管CLIP展现出卓越性能,但在噪声充斥的网络规模数据集上从头训练需要巨大的计算开销。另一方面,掩码预测预训练方法(如掩码图像建模MIM)为单模态表征提供了高效的自监督学习途径。本文提出非掩码令牌对齐方法,该方法利用现有CLIP模型进一步增强其视觉语言表征能力。UTA通过将未掩码的视觉令牌与冻结CLIP视觉编码器的对应图像令牌对齐来训练视觉Transformer模型,从而自动实现ViT模型与CLIP文本编码器的对齐。预训练后的ViT即使未经图像-文本对训练,也可直接应用于零样本评估。相较于MIM方法,UTA不存在训练-微调不一致性问题,且通过避免使用额外的[MASK]令牌显著提升了训练效率。大量实验结果表明,UTA能够有效增强CLIP模型,并在多种单模态与多模态基准测试中超越现有MIM方法。代码与模型已发布于https://github.com/jihaonew/UTA。