Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning

Fine-tuning vision-language models (VLMs) like CLIP to downstream tasks is often necessary to optimize their performance. However, a major obstacle is the limited availability of labeled data. We study the use of pseudolabels, i.e., heuristic labels for unlabeled data, to enhance CLIP via prompt tuning. Conventional pseudolabeling trains a model on labeled data and then generates labels for unlabeled data. VLMs' zero-shot capabilities enable a ``second generation'' of pseudolabeling approaches that do not require task-specific training on labeled data. By using zero-shot pseudolabels as a source of supervision, we observe that learning paradigms such as semi-supervised, transductive zero-shot, and unsupervised learning can all be seen as optimizing the same loss function. This unified view enables the development of versatile training strategies that are applicable across learning paradigms. We investigate them on image classification tasks where CLIP exhibits limitations, by varying prompt modalities, e.g., textual or visual prompts, and learning paradigms. We find that (1) unexplored prompt tuning strategies that iteratively refine pseudolabels consistently improve CLIP accuracy, by 19.5 points in semi-supervised learning, by 28.4 points in transductive zero-shot learning, and by 15.2 points in unsupervised learning, and (2) unlike conventional semi-supervised pseudolabeling, which exacerbates model biases toward classes with higher-quality pseudolabels, prompt tuning leads to a more equitable distribution of per-class accuracy. The code to reproduce the experiments is at github.com/BatsResearch/menghini-enhanceCLIPwithCLIP-code.

翻译：微调视觉-语言模型（VLMs）如CLIP以适应下游任务，通常需要优化其性能。然而，主要障碍是标注数据的有限可用性。我们研究使用伪标签（即未标注数据的启发式标签）通过提示调优来增强CLIP。传统伪标签方法在标注数据上训练模型，然后为未标注数据生成标签。VLMs的零样本能力使得“第二代”伪标签方法成为可能，无需在标注数据上进行特定任务的训练。通过将零样本伪标签作为监督来源，我们观察到半监督学习、直推式零样本学习和无监督学习等学习范式均可视为优化相同的损失函数。这种统一视角使得开发跨学习范式的通用训练策略成为可能。我们在CLIP表现受限的图像分类任务上，通过改变提示模态（如文本或视觉提示）和学习范式，对这些策略进行了研究。我们发现：（1）未探索的迭代优化伪标签的提示调优策略持续提升CLIP准确性，在半监督学习中提升19.5个百分点，在直推式零样本学习中提升28.4个百分点，在无监督学习中提升15.2个百分点；（2）与传统半监督伪标签方法会加剧模型对高质量伪标签类别的偏见不同，提示调优带来了更公平的每类准确率分布。复现实验的代码位于github.com/BatsResearch/menghini-enhanceCLIPwithCLIP-code。