Vision-Language Models (VLMs) such as CLIP are trained on large amounts of image-text pairs, resulting in remarkable generalization across several data distributions. The prohibitively expensive training and data collection/curation costs of these models make them valuable Intellectual Property (IP) for organizations. This motivates a vendor-client paradigm, where a vendor trains a large-scale VLM and grants only input-output access to clients on a pay-per-query basis in a black-box setting. The client aims to minimize inference cost by distilling the VLM to a student model using the limited available task-specific data, and further deploying this student model in the downstream application. While naive distillation largely improves the In-Domain (ID) accuracy of the student, it fails to transfer the superior out-of-distribution (OOD) generalization of the VLM teacher using the limited available labeled images. To mitigate this, we propose Vision-Language to Vision-Align, Distill, Predict (VL2V-ADiP), which first aligns the vision and language modalities of the teacher model with the vision modality of a pre-trained student model, and further distills the aligned VLM embeddings to the student. This maximally retains the pre-trained features of the student, while also incorporating the rich representations of the VLM image encoder and the superior generalization of the text embeddings. The proposed approach achieves state-of-the-art results on the standard Domain Generalization benchmarks in a black-box teacher setting, and also when weights of the VLM are accessible.
翻译:视觉-语言模型(如CLIP)通过大量图像-文本对的训练,展现出跨多种数据分布的卓越泛化能力。然而,其昂贵的训练成本与数据采集/整理费用使其成为组织的宝贵知识产权(IP)。这催生了供应商-客户范式:供应商训练大规模视觉-语言模型,并以黑盒模式按查询量向客户提供输入-输出访问权限。客户旨在通过有限的任务特定数据将视觉-语言模型蒸馏至学生模型以降低推理成本,并最终将此学生模型部署于下游应用。尽管朴素蒸馏显著提升了学生模型的域内(In-Domain, ID)准确率,但受限于少量标注图像,该方法无法有效迁移视觉-语言模型教师模型的优越分布外(Out-of-Distribution, OOD)泛化能力。为此,我们提出VL2V-ADiP(Vision-Language to Vision-Align, Distill, Predict)方法:首先对齐教师模型的视觉与语言模态与预训练学生模型的视觉模态,进一步将对齐后的视觉-语言模型嵌入蒸馏至学生模型。该方法在最大程度保留学生模型预训练特征的同时,融合了视觉-语言模型图像编码器的丰富表征与文本嵌入的卓越泛化特性。在标准域泛化基准测试中,所提方法在黑盒教师模式及可获取视觉-语言模型权重的情况下均取得最优结果。