Large Vision-Language Foundation Models (VLFM), such as CLIP, ALIGN and Florence, are trained on large-scale datasets of image-caption pairs and achieve superior transferability and robustness on downstream tasks, but they are difficult to use in many practical applications due to their large size, high latency and fixed architectures. Unfortunately, recent work shows training a small custom VLFM for resource-limited applications is currently very difficult using public and smaller-scale data. In this paper, we introduce a new distillation mechanism (DIME-FM) that allows us to transfer the knowledge contained in large VLFMs to smaller, customized foundation models using a relatively small amount of inexpensive, unpaired images and sentences. We transfer the knowledge from the pre-trained CLIP-ViTL/14 model to a ViT-B/32 model, with only 40M public images and 28.4M unpaired public sentences. The resulting model "Distill-ViT-B/32" rivals the CLIP-ViT-B/32 model pre-trained on its private WiT dataset (400M image-text pairs): Distill-ViT-B/32 achieves similar results in terms of zero-shot and linear-probing performance on both ImageNet and the ELEVATER (20 image classification tasks) benchmarks. It also displays comparable robustness when evaluated on five datasets with natural distribution shifts from ImageNet.
翻译:大型视觉-语言基础模型(如CLIP、ALIGN和Florence)通过在图像-文本对的大规模数据集上训练,在下游任务中展现出卓越的迁移性和鲁棒性,但因模型规模庞大、延迟高且架构固定,难以应用于诸多实际场景。然而,近期研究表明,使用公开的小规模数据训练适用于资源受限场景的小型定制化视觉-语言基础模型仍面临巨大挑战。本文提出一种新型蒸馏机制(DIME-FM),可利用少量低成本的非配对图像与句子,将大型视觉-语言基础模型中的知识迁移至更小、更定制化的基础模型。我们通过仅使用4000万张公开图像与2840万条公开非配对句子,将预训练CLIP-ViTL/14模型的知识迁移至ViT-B/32模型。所生成的模型“Distill-ViT-B/32”在性能上可与基于私有WiT数据集(4亿图像-文本对)预训练的CLIP-ViT-B/32模型相媲美:在ImageNet及ELEVATER(20项图像分类任务)基准测试中,两者的零样本与线性探测性能表现相近;当在ImageNet自然分布偏移的五组数据集上进行评估时,其鲁棒性同样具有可比性。