Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual representation learning by providing good performance on downstream datasets. VLMs are 0-shot adapted to a downstream dataset by designing prompts that are relevant to the dataset. Such prompt engineering makes use of domain expertise and a validation dataset. Meanwhile, recent developments in generative pretrained models like GPT-4 mean they can be used as advanced internet search tools. They can also be manipulated to provide visual information in any structure. In this work, we show that GPT-4 can be used to generate text that is visually descriptive and how this can be used to adapt CLIP to downstream tasks. We show considerable improvements in 0-shot transfer accuracy on specialized fine-grained datasets like EuroSAT (~7%), DTD (~7%), SUN397 (~4.6%), and CUB (~3.3%) when compared to CLIP's default prompt. We also design a simple few-shot adapter that learns to choose the best possible sentences to construct generalizable classifiers that outperform the recently proposed CoCoOP by ~2% on average and by over 4% on 4 specialized fine-grained datasets. The code, prompts, and auxiliary text dataset is available at https://github.com/mayug/VDT-Adapter.
翻译:对比预训练的大型视觉语言模型(如CLIP)通过在下游数据集上提供优异性能,彻底改变了视觉表征学习。视觉语言模型通过设计与数据集相关的提示,以零样本方式适应下游数据集。这种提示工程需要利用专业领域知识和验证数据集。与此同时,生成式预训练模型(如GPT-4)的最新进展使其可作为高级互联网搜索工具使用,并能够以任意结构提供视觉信息。在本研究中,我们证明GPT-4可用于生成视觉描述性文本,并展示如何利用这些文本使CLIP适应下游任务。与CLIP默认提示相比,我们在专业化细粒度数据集(如EuroSAT提升约7%、DTD提升约7%、SUN397提升约4.6%、CUB提升约3.3%)上实现了零样本迁移准确率的显著提升。此外,我们还设计了一种简单的少样本适配器,通过学习选择最佳语句来构建可泛化分类器,其在常见场景下平均性能较近期提出的CoCoOP提升约2%,在4个专业化细粒度数据集上提升超过4%。相关代码、提示及辅助文本数据集已开源至https://github.com/mayug/VDT-Adapter。