Learning to Prompt with Text Only Supervision for Vision-Language Models

Foundational vision-language models such as CLIP are becoming a new paradigm in vision, due to their excellent generalization abilities. However, adapting these models for downstream tasks while maintaining their generalization remains a challenge. In literature, one branch of methods adapts CLIP by learning prompts using visual information. While effective, most of these works require labeled data which is not practical, and often struggle to generalize towards new datasets due to over-fitting on the source data. An alternative approach resorts to training-free methods by generating class descriptions from large language models (LLMs) and perform prompt ensembling. However, these methods often generate class specific prompts that cannot be transferred to other classes, which incur higher costs by generating LLM descriptions for each class separately. In this work, we propose to combine the strengths of these both streams of methods by learning prompts using only text data derived from LLMs. As supervised training of prompts is not trivial due to absence of images, we develop a training approach that allows prompts to extract rich contextual knowledge from LLM data. Moreover, with LLM contextual data mapped within the learned prompts, it enables zero-shot transfer of prompts to new classes and datasets potentially cutting the LLM prompt engineering cost. To the best of our knowledge, this is the first work that learns generalized prompts using text only data. We perform extensive evaluations on 4 benchmarks where our method improves over prior ensembling works while being competitive to those utilizing labeled images. Our code and pre-trained models are available at https://github.com/muzairkhattak/ProText.

翻译：基础视觉-语言模型（如CLIP）因其卓越的泛化能力正成为视觉领域的新范式。然而，在保持泛化能力的同时将这些模型适配至下游任务仍是一大挑战。现有文献中，一类方法通过利用视觉信息学习提示来适配CLIP。这类方法虽有效，但大多需要标注数据（这在实践中难以实现），且常因过拟合源数据而难以泛化至新数据集。另一类替代方法则采用免训练策略，通过从大语言模型（LLMs）生成类别描述并执行提示集成。然而，这些方法生成的类别特定提示无法迁移至其他类别，且需为每个类别单独生成LLM描述，导致成本高昂。本研究提出结合两类方法的优势，仅利用LLM生成的文本数据学习提示。由于缺乏图像数据，提示的监督训练并非易事，我们开发了一种训练方法，使提示能从LLM数据中提取丰富的上下文知识。此外，通过将LLM上下文数据映射至学得的提示中，提示可实现零样本迁移至新类别和新数据集，从而可能降低LLM提示工程成本。据我们所知，这是首个仅使用文本数据学习广义提示的工作。我们在4个基准数据集上进行了广泛评估，结果表明，本方法在优于先前集成方法的同时，与利用标注图像的方法具有竞争力。我们的代码和预训练模型已开源至 https://github.com/muzairkhattak/ProText。