Prompt learning has emerged as a valuable technique in enhancing vision-language models (VLMs) such as CLIP for downstream tasks in specific domains. Existing work mainly focuses on designing various learning forms of prompts, neglecting the potential of prompts as effective distillers for learning from larger teacher models. In this paper, we introduce an unsupervised domain prompt distillation framework, which aims to transfer the knowledge of a larger teacher model to a lightweight target model through prompt-driven imitation using unlabeled domain images. Specifically, our framework consists of two distinct stages. In the initial stage, we pre-train a large CLIP teacher model using domain (few-shot) labels. After pre-training, we leverage the unique decoupled-modality characteristics of CLIP by pre-computing and storing the text features as class vectors only once through the teacher text encoder. In the subsequent stage, the stored class vectors are shared across teacher and student image encoders for calculating the predicted logits. Further, we align the logits of both the teacher and student models via KL divergence, encouraging the student image encoder to generate similar probability distributions to the teacher through the learnable prompts. The proposed prompt distillation process eliminates the reliance on labeled data, enabling the algorithm to leverage a vast amount of unlabeled images within the domain. Finally, the well-trained student image encoders and pre-stored text features (class vectors) are utilized for inference. To our best knowledge, we are the first to (1) perform unsupervised domain-specific prompt-driven knowledge distillation for CLIP, and (2) establish a practical pre-storing mechanism of text features as shared class vectors between teacher and student. Extensive experiments on 11 datasets demonstrate the effectiveness of our method.
翻译:提示学习已成为提升视觉语言模型(如CLIP)在特定领域下游任务性能的重要技术。现有研究主要关注设计不同形式的提示学习方式,却忽略了提示作为从大型教师模型进行学习的高效蒸馏器的潜力。本文提出了一种无监督领域提示蒸馏框架,旨在通过无标签领域图像驱动的模仿学习,将大型教师模型的知识迁移至轻量级目标模型。具体而言,该框架包含两个不同阶段。第一阶段,我们使用领域(少样本)标签预训练一个大型CLIP教师模型。预训练完成后,利用CLIP独特的解耦模态特性,通过教师文本编码器仅需一次预计算并存储文本特征作为类别向量。第二阶段,这些存储的类别向量在教师与学生图像编码器之间共享,用于计算预测逻辑值。进一步,我们通过KL散度对齐教师与学生模型的逻辑值分布,促使学生图像编码器通过可学习提示生成与教师相似的概率分布。所提出的提示蒸馏过程消除了对标注数据的依赖,使算法能够利用领域内海量无标签图像。最终,利用训练完备的学生图像编码器与预存储的文本特征(类别向量)进行推理。据我们所知,本文首次实现:(1) CLIP模型的无监督领域特定提示驱动知识蒸馏;(2) 建立将文本特征作为教师与学生共享类别向量的实用预存储机制。在11个数据集上的广泛实验验证了该方法的高效性。