The original CLIP text encoder is limited by a maximum input length of 77 tokens, which hampers its ability to effectively process long texts and perform fine-grained semantic understanding. In addition, the CLIP text encoder lacks support for multilingual inputs. All these limitations significantly restrict its applicability across a broader range of tasks. Recent studies have attempted to replace the CLIP text encoder with an LLM-based embedder to enhance its ability in processing long texts, multilingual understanding, and fine-grained semantic comprehension. However, because the representation spaces of LLMs and the vision-language space of CLIP are pretrained independently without alignment priors, direct alignment using contrastive learning can disrupt the intrinsic vision-language alignment in the CLIP image encoder, leading to an underutilization of the knowledge acquired during pre-training. To address this challenge, we propose ProCLIP, a curriculum learning-based progressive vision-language alignment framework to effectively align the CLIP image encoder with an LLM-based embedder. Specifically, ProCLIP first distills knowledge from CLIP's text encoder into the LLM-based embedder to leverage CLIP's rich pretrained knowledge while establishing initial alignment between the LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns the CLIP image encoder with the LLM-based embedder through image-text contrastive tuning, employing self-distillation regularization to avoid overfitting. To achieve a more effective alignment, instance semantic alignment loss and embedding structure alignment loss are employed during representation inheritance and contrastive tuning. The Code is available at https://github.com/VisionXLab/ProCLIP.
翻译:原始CLIP文本编码器受限于77个词元的最大输入长度,这阻碍了其有效处理长文本和执行细粒度语义理解的能力。此外,CLIP文本编码器缺乏对多语言输入的支持。所有这些限制显著制约了其在更广泛任务中的适用性。近期研究尝试用基于LLM的嵌入器替代CLIP文本编码器,以增强其处理长文本、多语言理解和细粒度语义理解的能力。然而,由于LLM的表征空间与CLIP的视觉-语言空间是独立预训练且缺乏对齐先验的,直接使用对比学习进行对齐可能会破坏CLIP图像编码器内在的视觉-语言对齐特性,导致预训练阶段获得的知识未被充分利用。为解决这一挑战,我们提出ProCLIP——一种基于课程学习的渐进式视觉-语言对齐框架,旨在有效对齐CLIP图像编码器与基于LLM的嵌入器。具体而言,ProCLIP首先从CLIP文本编码器向基于LLM的嵌入器蒸馏知识,以利用CLIP丰富的预训练知识,同时建立LLM嵌入器与CLIP图像编码器的初始对齐。随后,ProCLIP通过图像-文本对比微调进一步对齐CLIP图像编码器与基于LLM的嵌入器,并采用自蒸馏正则化以避免过拟合。为实现更有效的对齐,在表征继承和对比微调阶段分别引入了实例语义对齐损失与嵌入结构对齐损失。代码已发布于https://github.com/VisionXLab/ProCLIP。