Pre-trained vision-language models such as CLIP exhibit strong transferability, yet adapting them to downstream image classification tasks under limited annotation budgets remains challenging. In active learning settings, the model must select the most informative samples for annotation from a large pool of unlabeled data. Existing approaches typically estimate uncertainty via entropy-based criteria or representation clustering, without explicitly modeling uncertainty from the model perspective. In this work, we propose a robust uncertainty modeling framework for active CLIP adaptation based on dual-prompt tuning. We introduce two learnable prompts in the textual branch of CLIP. The positive prompt enhances the discriminability of task-specific textual embeddings corresponding to light-weight tuned visual embeddings, improving classification reliability. Meanwhile, the negative prompt is trained in an reversed manner to explicitly model the probability that the predicted label is correct, providing a principled uncertainty signal for guiding active sample selection. Extensive experiments across different fine-tuning paradigms demonstrate that our method consistently outperforms existing active learning methods under the same annotation budget.
翻译:预训练的视觉语言模型(如CLIP)展现出强大的可迁移性,但在有限标注预算下将其适配至下游图像分类任务仍具挑战性。在主动学习场景中,模型需要从大量未标注数据池中选择最具信息量的样本进行标注。现有方法通常通过基于熵的准则或表征聚类来估计不确定性,而未从模型视角显式建模不确定性。本研究提出一种基于双提示调优的鲁棒不确定性建模框架,用于主动CLIP自适应。我们在CLIP的文本分支中引入两个可学习的提示:正提示通过增强与轻量调优视觉嵌入对应的任务特定文本嵌入的判别性,提升分类可靠性;同时,负提示以反向训练方式显式建模预测标签正确的概率,为引导主动样本选择提供理论依据的不确定性信号。在不同微调范式下的广泛实验表明,在相同标注预算下,本方法持续优于现有主动学习方法。