Multi-modal foundation models such as CLIP have showcased impressive zero-shot capabilities. However, their applicability in resource-constrained environments is limited due to their large number of parameters and high inference time. While existing approaches have scaled down the entire CLIP architecture, we focus on training smaller variants of the image encoder, which suffices for efficient zero-shot classification. The use of synthetic data has shown promise in distilling representations from larger teachers, resulting in strong few-shot and linear probe performance. However, we find that this approach surprisingly fails in true zero-shot settings when using contrastive losses. We identify the exploitation of spurious features as being responsible for poor generalization between synthetic and real data. However, by using the image feature-based L2 distillation loss, we mitigate these problems and train students that achieve zero-shot performance which on four domain-specific datasets is on-par with a ViT-B/32 teacher model trained on DataCompXL, while featuring up to 92% fewer parameters.
翻译:多模态基础模型(如CLIP)已展现出令人印象深刻的零样本能力。然而,由于参数数量庞大且推理时间长,它们在资源受限环境中的适用性受到限制。虽然现有方法已尝试缩放整个CLIP架构,但我们专注于训练图像编码器的更小变体,这足以实现高效的零样本分类。使用合成数据在从较大教师模型中蒸馏表示方面已显示出潜力,从而在少样本和线性探测任务中取得优异性能。然而,我们发现在真正的零样本设置中使用对比损失时,这种方法意外地失败。我们识别出对虚假特征的利用是导致合成数据与真实数据之间泛化能力差的原因。然而,通过使用基于图像特征的L2蒸馏损失,我们缓解了这些问题,并训练出在四个领域特定数据集上达到与基于DataCompXL训练的ViT-B/32教师模型相当的零样本性能的学生模型,同时参数数量减少多达92%。