CLIP is a seminal multimodal model that maps images and text into a shared representation space through contrastive learning on billions of image-caption pairs. Inspired by the rapid progress of large language models (LLMs), we investigate how the superior linguistic understanding and broad world knowledge of LLMs can further strengthen CLIP, particularly in handling long and complex captions. We introduce an efficient fine-tuning framework that embeds an LLM into a pretrained CLIP while incurring nearly the same training cost as standard CLIP fine-tuning. Our method first converts the LLM into an embedding-compatible form for the CLIP setting, and then couples it with the pretrained CLIP vision encoder through a lightweight adaptor trained on only a few million image-caption pairs. With this strategy, we achieve large performance gains without large-scale retraining, outperforming state-of-the-art CLIP variants such as EVA02 and SigLIP-2. The LLM-enhanced CLIP delivers consistent improvements across a wide range of downstream tasks, including linear-probe classification, zero-shot image-text retrieval with both short and long captions (in English and other languages), zero-shot and supervised image segmentation, object detection, and serving as a tokenizer backbone for multimodal large-model benchmarks. Code and models are available at: https://aka.ms/llm2clip
翻译:CLIP 是一种开创性的多模态模型,它通过在数十亿图像-文本对上进行对比学习,将图像和文本映射到一个共享的表示空间中。受大型语言模型(LLMs)快速发展的启发,我们研究了如何利用LLMs卓越的语言理解能力和广泛的世界知识来进一步增强CLIP,特别是在处理长而复杂的文本描述方面。我们引入了一种高效的微调框架,该框架将LLM嵌入到预训练的CLIP中,同时产生的训练成本几乎与标准的CLIP微调相同。我们的方法首先将LLM转换为适用于CLIP设置的嵌入兼容形式,然后通过一个仅在数百万图像-文本对上训练的轻量级适配器,将其与预训练的CLIP视觉编码器耦合。通过这一策略,我们无需大规模重新训练即可实现显著的性能提升,超越了EVA02和SigLIP-2等最先进的CLIP变体。LLM增强的CLIP在广泛的下游任务中均带来了一致的改进,包括线性探针分类、使用短文本和长文本(英语及其他语言)的零样本图像-文本检索、零样本和有监督的图像分割、目标检测,以及作为多模态大模型基准测试的分词器骨干。代码和模型可在以下网址获取:https://aka.ms/llm2clip