Prompt learning represents a promising method for adapting pre-trained vision-language models (VLMs) to various downstream tasks by learning a set of text embeddings. One challenge inherent to these methods is the poor generalization performance due to the invalidity of the learned text embeddings for unseen tasks. A straightforward approach to bridge this gap is to freeze the text embeddings in prompts, which results in a lack of capacity to adapt VLMs for downstream tasks. To address this dilemma, we propose a paradigm called EnPrompt with a novel External Layer (EnLa). Specifically, we propose a textual external layer and learnable visual embeddings for adapting VLMs to downstream tasks. The learnable external layer is built upon valid embeddings of pre-trained CLIP. This design considers the balance of learning capabilities between the two branches. To align the textual and visual features, we propose a novel two-pronged approach: i) we introduce the optimal transport as the discrepancy metric to align the vision and text modalities, and ii) we introduce a novel strengthening feature to enhance the interaction between these two modalities. Four representative experiments (i.e., base-to-novel generalization, few-shot learning, cross-dataset generalization, domain shifts generalization) across 15 datasets demonstrate that our method outperforms the existing prompt learning method.
翻译:提示学习是一种通过习得一组文本嵌入来使预训练视觉-语言模型适应各种下游任务的有效方法。这类方法面临的一个固有挑战是:由于习得的文本嵌入对未见任务缺乏有效性,导致泛化性能不佳。解决此问题的一种直接方法是冻结提示中的文本嵌入,但这会削弱模型适应下游任务的能力。为应对这一困境,我们提出了一种名为EnPrompt的新范式,其核心是引入创新的外部层结构。具体而言,我们设计了文本外部层与可学习的视觉嵌入模块,以实现视觉-语言模型在下游任务中的适配。该可学习外部层构建于预训练CLIP模型的有效嵌入之上,其设计兼顾了两个分支的学习能力平衡。为实现文本与视觉特征的对齐,我们提出了一种新颖的双轨策略:i)引入最优传输作为跨模态差异度量以对齐视觉与文本模态;ii)设计全新的强化特征机制以增强双模态间的交互作用。通过在15个数据集上进行的四项代表性实验(包括基类-新类泛化、小样本学习、跨数据集泛化及域偏移泛化),本方法均显著超越了现有提示学习方法的性能表现。