Large pre-trained vision-language models such as CLIP have demonstrated great potential in zero-shot transferability to downstream tasks. However, to attain optimal performance, the manual selection of prompts is necessary to improve alignment between the downstream image distribution and the textual class descriptions. This manual prompt engineering is the major challenge for deploying such models in practice since it requires domain expertise and is extremely time-consuming. To avoid non-trivial prompt engineering, recent work Context Optimization (CoOp) introduced the concept of prompt learning to the vision domain using learnable textual tokens. While CoOp can achieve substantial improvements over manual prompts, its learned context is worse generalizable to wider unseen classes within the same dataset. In this work, we present Prompt Learning with Reparameterization Encoder (PRE) - a simple and efficient method that enhances the generalization ability of the learnable prompt to unseen classes while maintaining the capacity to learn Base classes. Instead of directly optimizing the prompts, PRE employs a prompt encoder to reparameterize the input prompt embeddings, enhancing the exploration of task-specific knowledge from few-shot samples. Experiments and extensive ablation studies on 8 benchmarks demonstrate that our approach is an efficient method for prompt learning. Specifically, PRE achieves a notable enhancement of 5.60% in average accuracy on New classes and 3% in Harmonic mean compared to CoOp in the 16-shot setting, all achieved within a good training time.
翻译:摘要:大规模预训练视觉语言模型(如CLIP)在零样本迁移至下游任务中展现出巨大潜力。然而,为达到最优性能,需手动设计提示以改善下游图像分布与文本类别描述之间的对齐性。这种手动提示工程是实际部署此类模型的主要挑战,因其需要领域专业知识且极为耗时。为避免复杂的提示工程,近期工作Context Optimization(CoOp)将提示学习的概念引入视觉领域,采用可学习的文本标记。尽管CoOp相比手动提示能实现显著改进,但其学习到的上下文在相同数据集的未见类别上泛化能力较弱。本文提出基于重参数化编码器的提示学习(PRE)——一种简洁高效的方法,可增强可学习提示对未见类别的泛化能力,同时保持对基类别的学习能力。PRE并非直接优化提示,而是采用提示编码器对输入提示嵌入进行重参数化,从而增强从少样本中探索任务特定知识的能力。在8个基准上的实验与广泛消融研究表明,该方法是一种高效的提示学习方式。具体而言,在16样本设置下,PRE相较于CoOp在新类别平均准确率上实现5.60%的显著提升,调和均值提升3%,且训练时间保持良好。