Prompt learning has emerged as a valuable technique in enhancing vision-language models (VLMs) such as CLIP for downstream tasks in specific domains. Existing work mainly focuses on designing various learning forms of prompts, neglecting the potential of prompts as effective distillers for learning from larger teacher models. In this paper, we introduce an unsupervised domain prompt distillation framework, which aims to transfer the knowledge of a larger teacher model to a lightweight target model through prompt-driven imitation using unlabeled domain images. Specifically, our framework consists of two distinct stages. In the initial stage, we pre-train a large CLIP teacher model using domain (few-shot) labels. After pre-training, we leverage the unique decoupled-modality characteristics of CLIP by pre-computing and storing the text features as class vectors only once through the teacher text encoder. In the subsequent stage, the stored class vectors are shared across teacher and student image encoders for calculating the predicted logits. Further, we align the logits of both the teacher and student models via KL divergence, encouraging the student image encoder to generate similar probability distributions to the teacher through the learnable prompts. The proposed prompt distillation process eliminates the reliance on labeled data, enabling the algorithm to leverage a vast amount of unlabeled images within the domain. Finally, the well-trained student image encoders and pre-stored text features (class vectors) are utilized for inference. To our best knowledge, we are the first to (1) perform unsupervised domain-specific prompt-driven knowledge distillation for CLIP, and (2) establish a practical pre-storing mechanism of text features as shared class vectors between teacher and student. Extensive experiments on 11 datasets demonstrate the effectiveness of our method.
翻译:提示学习已成为增强视觉语言模型(如CLIP)在特定领域下游任务中的重要技术。现有工作主要关注设计多种提示学习形式,忽视了提示作为从更大教师模型中学习有效蒸馏器的潜力。本文提出一种无监督领域提示蒸馏框架,旨在通过无标注领域图像的提示驱动模仿,将更大教师模型的知识迁移至轻量级目标模型。具体而言,该框架包含两个不同阶段。在初始阶段,我们利用领域(少样本)标注预训练大型CLIP教师模型。预训练后,利用CLIP独特的解耦模态特性,通过教师文本编码器一次性预计算并存储文本特征作为类别向量。在后续阶段,存储的类别向量在教师与学生图像编码器之间共享,用于计算预测对数几率。进一步,通过KL散度对齐教师与学生模型的对数几率,促使学生图像编码器通过可学习提示生成与教师相似的概率分布。所提出的提示蒸馏过程消除了对标注数据的依赖,使算法能够利用领域内大量无标注图像。最终,采用训练良好的学生图像编码器与预存储的文本特征(类别向量)进行推理。据我们所知,本文首次实现:(1)面向CLIP的无监督领域特定提示驱动知识蒸馏,(2)建立文本特征作为教师与学生共享类别向量的实用预存储机制。在11个数据集上的大量实验证明了该方法的有效性。