Prompt learning has emerged as a valuable technique in enhancing vision-language models (VLMs) such as CLIP for downstream tasks in specific domains. Existing work mainly focuses on designing various learning forms of prompts, neglecting the potential of prompts as effective distillers for learning from larger teacher models. In this paper, we introduce an unsupervised domain prompt distillation framework, which aims to transfer the knowledge of a larger teacher model to a lightweight target model through prompt-driven imitation using unlabeled domain images. Specifically, our framework consists of two distinct stages. In the initial stage, we pre-train a large CLIP teacher model using domain (few-shot) labels. After pre-training, we leverage the unique decoupled-modality characteristics of CLIP by pre-computing and storing the text features as class vectors only once through the teacher text encoder. In the subsequent stage, the stored class vectors are shared across teacher and student image encoders for calculating the predicted logits. Further, we align the logits of both the teacher and student models via KL divergence, encouraging the student image encoder to generate similar probability distributions to the teacher through the learnable prompts. The proposed prompt distillation process eliminates the reliance on labeled data, enabling the algorithm to leverage a vast amount of unlabeled images within the domain. Finally, the well-trained student image encoders and pre-stored text features (class vectors) are utilized for inference. To our best knowledge, we are the first to (1) perform unsupervised domain-specific prompt-driven knowledge distillation for CLIP, and (2) establish a practical pre-storing mechanism of text features as shared class vectors between teacher and student. Extensive experiments on 11 datasets demonstrate the effectiveness of our method.
翻译:提示学习已成为增强视觉语言模型(如CLIP)在特定领域下游任务性能的重要技术。现有工作主要关注设计不同形式的提示学习方式,却忽视了提示作为从更大教师模型中进行有效蒸馏的学习潜力。本文提出一种无监督领域提示蒸馏框架,旨在通过基于提示的无标签领域图像模仿,将更大教师模型的知识迁移至轻量级目标模型。具体而言,该框架包含两个独立阶段:初始阶段,我们利用领域(小样本)标签预训练大型CLIP教师模型;预训练后,利用CLIP独特的解耦模态特性,通过教师文本编码器仅需一次预计算并存储文本特征作为类别向量。后续阶段中,这些存储的类别向量在教师和学生图像编码器间共享,用于计算预测logits。进一步,我们通过KL散度对齐师生模型的预测logits,促使学生图像编码器通过可学习提示生成与教师模型相似的概率分布。所提出的提示蒸馏过程消除了对标注数据的依赖,使算法能够充分利用领域内海量无标签图像。最后,采用训练完备的学生图像编码器与预存储的文本特征(类别向量)进行推理。据我们所知,本文首次实现:(1)面向CLIP的无监督特定领域提示驱动知识蒸馏;(2)建立将文本特征作为师生模型共享类别向量的实用预存储机制。在11个数据集上的广泛实验验证了本方法的有效性。