Contrastive Language-Image Pre-training (CLIP) has become a promising language-supervised visual pre-training framework. This paper aims to distill small CLIP models supervised by a large teacher CLIP model. We propose several distillation strategies, including relation, feature, gradient and contrastive paradigms, to examine the effectiveness of CLIP-Knowledge Distillation (KD). We show that a simple feature mimicry with Mean Squared Error loss works surprisingly well. Moreover, interactive contrastive learning across teacher and student encoders is also effective in performance improvement. We explain that the success of CLIP-KD can be attributed to maximizing the feature similarity between teacher and student. The unified method is applied to distill several student models trained on CC3M+12M. CLIP-KD improves student CLIP models consistently over zero-shot ImageNet classification and cross-modal retrieval benchmarks. When using ViT-L/14 pretrained on Laion-400M as the teacher, CLIP-KD achieves 57.5\% and 55.4\% zero-shot top-1 ImageNet accuracy over ViT-B/16 and ResNet-50, surpassing the original CLIP without KD by 20.5\% and 20.1\% margins, respectively. Our code is released on https://github.com/winycg/CLIP-KD.
翻译:对比语言-图像预训练(CLIP)已成为一种极具前景的语言监督视觉预训练框架。本文旨在通过大型教师CLIP模型蒸馏出小型学生CLIP模型。我们提出了多种蒸馏策略,包括关系范式、特征范式、梯度范式和对比范式,以检验CLIP知识蒸馏(KD)的有效性。研究表明,基于均方误差损失的简单特征模仿方法效果出奇地好。此外,师生编码器之间的交互式对比学习在性能提升方面同样有效。我们解释称,CLIP-KD的成功可归因于最大化师生特征相似性。该统一方法被应用于蒸馏在CC3M+12M上训练的多个学生模型。在零样本ImageNet分类和跨模态检索基准测试中,CLIP-KD持续提升了学生CLIP模型的性能。使用在Laion-400M上预训练的ViT-L/14作为教师时,CLIP-KD在ViT-B/16和ResNet-50上分别实现了57.5%和55.4%的零样本ImageNet Top-1准确率,相较于未使用KD的原始CLIP模型分别提升了20.5%和20.1%。我们的代码已开源在https://github.com/winycg/CLIP-KD。