SEP: Self-Enhanced Prompt Tuning for Visual-Language Model

Prompt tuning based on Context Optimization (CoOp) effectively adapts visual-language models (VLMs) to downstream tasks by inferring additional learnable prompt tokens. However, these tokens are less discriminative as they are independent of the pre-trained tokens and fail to capture input-specific knowledge, such as class-aware textual or instance-aware visual knowledge. Leveraging the discriminative and generalization capabilities inherent in pre-trained tokens, we introduce a novel approach named Self-Enhanced Prompt Tuning (SEP). The core principle of SEP involves adapting the learnable prompt tokens at each encoder layer from the corresponding self-pretrained tokens, thereby explicitly incorporating discriminative prior knowledge to enhance both textual-level and visual-level embeddings. Furthermore, SEP's self-enhanced tokens not only boost discrimination but also mitigate domain shifts in unseen domains, enhancing generalization. In practice, SEP selects several representative tokens from all pre-trained tokens for each input data at every layer of the text/visual encoders. Subsequently, a Token Fusion Module (TFM) is introduced to generate a self-enhanced token by merging these representative tokens with the learnable tokens using a cross-attention mechanism. This self-enhanced token is then concatenated with all pre-trained tokens, serving as input for subsequent encoder layers to produce the relevant embeddings. Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning. Code: \href{Code}{https://github.com/htyao89/SEP}.

翻译：基于上下文优化（CoOp）的提示调优通过推断额外的可学习提示标记，能有效使视觉语言模型（VLM）适应下游任务。然而，这些标记由于独立于预训练标记且未能捕获输入特定知识（如类感知的文本知识或实例感知的视觉知识），其判别能力较弱。利用预训练标记固有的判别与泛化能力，本文提出一种名为自增强提示调优（SEP）的新方法。SEP的核心原理是在每个编码器层中，从对应的自预训练标记自适应地生成可学习提示标记，从而显式融入判别性先验知识以增强文本级和视觉级嵌入表示。此外，SEP的自增强标记不仅能提升判别能力，还能缓解未见领域中的域偏移问题，从而增强泛化性能。具体实现中，SEP在文本/视觉编码器的每一层中，为每个输入数据从所有预训练标记中选取若干代表性标记。随后，通过引入标记融合模块（TFM），利用交叉注意力机制将这些代表性标记与可学习标记融合生成自增强标记。该自增强标记将与所有预训练标记拼接，作为后续编码器层的输入以生成相关嵌入表示。在多基准测试与任务上的综合评估验证了SEP在提示调优中的有效性。代码：\href{Code}{https://github.com/htyao89/SEP}。