Recently, CLIP has emerged as a valuable model for aligning image and text information in multi-modal scenarios. However, researchers have observed limitations in the ability of CLIP's text and image encoders to extract detailed knowledge from caption-image pairs. In response, this paper introduces KKLIP, a novel approach designed to enhance the quality of CLIP by incorporating a new knowledge distillation (KD) method derived from Llama 2. Our method comprises three objectives: Text Embedding Distillation, Concept Learning, and Contrastive Learning. Firstly, Text Embedding Distillation involves training the KKLIP text encoder to emulate the teacher model, Llama 2. Secondly, Concept Learning assigns a soft concept label to each caption-image pair through offline k-means clustering of text information from Llama 2, allowing KKLIP to learn from these soft concept labels. Finally, Contrastive Learning harmonizes text and image embeddings. Our experimental results demonstrate that KKLIP enhances the quality of both text and image encoders.
翻译:近年来,CLIP已成为多模态场景中对齐图像与文本信息的重要模型。然而,研究者发现CLIP的文本编码器和图像编码器在从图文对中提取细节知识的能力上存在局限。为此,本文提出KKLIP,一种通过引入源自Llama 2的新型知识蒸馏方法来提升CLIP质量的新颖方案。我们的方法包含三个目标:文本嵌入蒸馏、概念学习和对比学习。首先,文本嵌入蒸馏训练KKLIP的文本编码器以模拟教师模型Llama 2。其次,概念学习通过对Llama 2的文本信息进行离线k均值聚类,为每个图文对分配软概念标签,使KKLIP能够从这些软概念标签中学习。最后,对比学习协调文本与图像嵌入。实验结果表明,KKLIP有效提升了文本编码器与图像编码器的质量。