Large pre-trained vision-language models such as CLIP have demonstrated great potential in zero-shot transferability to downstream tasks. However, to attain optimal performance, the manual selection of prompts is necessary to improve alignment between the downstream image distribution and the textual class descriptions. This manual prompt engineering is the major challenge for deploying such models in practice since it requires domain expertise and is extremely time-consuming. To avoid non-trivial prompt engineering, recent work Context Optimization (CoOp) introduced the concept of prompt learning to the vision domain using learnable textual tokens. While CoOp can achieve substantial improvements over manual prompts, its learned context is worse generalizable to wider unseen classes within the same dataset. In this work, we present Prompt Learning with Reparameterization Encoder (PRE) - a simple and efficient method that enhances the generalization ability of the learnable prompt to unseen classes while maintaining the capacity to learn Base classes. Instead of directly optimizing the prompts, PRE employs a prompt encoder to reparameterize the input prompt embeddings, enhancing the exploration of task-specific knowledge from few-shot samples. Experiments and extensive ablation studies on 8 benchmarks demonstrate that our approach is an efficient method for prompt learning. Specifically, PRE achieves a notable enhancement of 5.60% in average accuracy on New classes and 3% in Harmonic mean compared to CoOp in the 16-shot setting, all achieved within a good training time.
翻译:摘要:大型预训练视觉-语言模型(如CLIP)在下游任务的零样本迁移能力中展现出巨大潜力。然而,为实现最优性能,需手动选择提示词以改善下游图像分布与文本类别描述之间的对齐。此类手动提示工程是实际部署此类模型的主要挑战,因为它需要领域专业知识且极其耗时。为规避繁琐的提示工程,近期研究中的上下文优化方法(CoOp)通过可学习的文本标记,将提示学习概念引入视觉领域。虽然CoOp相比手动提示能实现显著改进,但其学习到的上下文对同一数据集中更广泛的未见类别的泛化能力较差。本文提出基于重参数化编码器的提示学习方法(PRE)——一种简单高效的方法,可在保持基类学习能力的同时,增强可学习提示对未见类别的泛化能力。与直接优化提示不同,PRE采用提示编码器对输入提示嵌入进行重参数化,从而增强从少量样本中探索任务特定知识的能力。在8个基准上的实验及大量消融研究表明,我们的方法是提示学习的高效途径。具体而言,在16-shot设置下,PRE在未见类别上的平均准确率提升5.60%,调和均值提升3%,均优于CoOp,且训练时间合理可控。