Cascade Prompt Learning for Vision-Language Model Adaptation

Prompt learning has surfaced as an effective approach to enhance the performance of Vision-Language Models (VLMs) like CLIP when applied to downstream tasks. However, current learnable prompt tokens are primarily used for the single phase of adapting to tasks (i.e., adapting prompt), easily leading to overfitting risks. In this work, we propose a novel Cascade Prompt Learning CasPL framework to enable prompt learning to serve both generic and specific expertise (i.e., boosting and adapting prompt) simultaneously. Specifically, CasPL is a new learning paradigm comprising two distinct phases of learnable prompts: the first boosting prompt is crafted to extract domain-general knowledge from a senior larger CLIP teacher model by aligning their predicted logits using extensive unlabeled domain images. The second adapting prompt is then cascaded with the frozen first set to fine-tune the downstream tasks, following the approaches employed in prior research. In this manner, CasPL can effectively capture both domain-general and task-specific representations into explicitly different gradual groups of prompts, thus potentially alleviating overfitting issues in the target domain. It's worth noting that CasPL serves as a plug-and-play module that can seamlessly integrate into any existing prompt learning approach. CasPL achieves a significantly better balance between performance and inference speed, which is especially beneficial for deploying smaller VLM models in resource-constrained environments. Compared to the previous state-of-the-art method PromptSRC, CasPL shows an average improvement of 1.85% for base classes, 3.44% for novel classes, and 2.72% for the harmonic mean over 11 image classification datasets. Code is publicly available at: https://github.com/megvii-research/CasPL.

翻译：提示学习已成为提升视觉语言模型（如CLIP）在下游任务中性能的有效方法。然而，当前可学习的提示标记主要用于适应任务的单一阶段（即适应提示），容易导致过拟合风险。本文提出一种新颖的级联提示学习框架CasPL，使提示学习能够同时服务于通用与特定专业知识（即增强提示与适应提示）。具体而言，CasPL是一种包含两个不同可学习提示阶段的新学习范式：首先设计的增强提示通过使用大量未标注领域图像对齐预测逻辑值，从更资深的大型CLIP教师模型中提取领域通用知识；随后，第二个适应提示与冻结的第一组提示级联，按照先前研究采用的方法对下游任务进行微调。通过这种方式，CasPL能够将领域通用和任务特定的表征有效捕获到明确分层的渐进式提示组中，从而可能缓解目标领域的过拟合问题。值得注意的是，CasPL可作为即插即用模块无缝集成到任何现有提示学习方法中。CasPL在性能与推理速度之间实现了显著更好的平衡，这对于在资源受限环境中部署较小的视觉语言模型尤为有益。与先前最先进方法PromptSRC相比，CasPL在11个图像分类数据集上对基类平均提升1.85%，对新类平均提升3.44%，调和平均提升2.72%。代码公开于：https://github.com/megvii-research/CasPL。