Large-scale vision-language models (VLMs), e.g., CLIP, learn broad visual concepts from tedious training data, showing superb generalization ability. Amount of prompt learning methods have been proposed to efficiently adapt the VLMs to downstream tasks with only a few training samples. We introduce a novel method to improve the prompt learning of vision-language models by incorporating pre-trained large language models (LLMs), called Dual-Aligned Prompt Tuning (DuAl-PT). Learnable prompts, like CoOp, implicitly model the context through end-to-end training, which are difficult to control and interpret. While explicit context descriptions generated by LLMs, like GPT-3, can be directly used for zero-shot classification, such prompts are overly relying on LLMs and still underexplored in few-shot domains. With DuAl-PT, we propose to learn more context-aware prompts, benefiting from both explicit and implicit context modeling. To achieve this, we introduce a pre-trained LLM to generate context descriptions, and we encourage the prompts to learn from the LLM's knowledge by alignment, as well as the alignment between prompts and local image features. Empirically, DuAl-PT achieves superior performance on 11 downstream datasets on few-shot recognition and base-to-new generalization. Hopefully, DuAl-PT can serve as a strong baseline. Code will be available.
翻译:大规模视觉-语言模型(VLMs,如CLIP)通过大规模训练数据学习了广泛的视觉概念,展现出卓越的泛化能力。目前已提出多种提示学习方法,旨在利用少量训练样本高效地将VLMs适配至下游任务。我们提出一种新颖方法,通过融合预训练大语言模型(LLMs)来改进视觉-语言模型的提示学习,称为双对齐提示微调(DuAl-PT)。可学习提示(如CoOp)通过端到端训练隐式建模上下文,但难以控制和解释;而LLMs(如GPT-3)生成的显式上下文描述可直接用于零样本分类,但此类提示过度依赖LLMs且在少样本领域仍待深入探索。通过DuAl-PT,我们提出学习更具上下文感知能力的提示,同时受益于显式与隐式上下文建模。为此,我们引入预训练LLM生成上下文描述,并通过对齐策略鼓励提示从LLM的知识中学习,同时实现提示与局部图像特征的对齐。实验表明,DuAl-PT在11个下游数据集的少样本识别及基类到新类泛化任务中均取得优越性能。DuAl-PT有望成为强基线方法。代码将公开。