Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual representation learning by providing good performance on downstream datasets. VLMs are 0-shot adapted to a downstream dataset by designing prompts that are relevant to the dataset. Such prompt engineering makes use of domain expertise and a validation dataset. Meanwhile, recent developments in generative pretrained models like GPT-4 mean they can be used as advanced internet search tools. They can also be manipulated to provide visual information in any structure. In this work, we show that GPT-4 can be used to generate text that is visually descriptive and how this can be used to adapt CLIP to downstream tasks. We show considerable improvements in 0-shot transfer accuracy on specialized fine-grained datasets like EuroSAT (~7%), DTD (~7%), SUN397 (~4.6%), and CUB (~3.3%) when compared to CLIP's default prompt. We also design a simple few-shot adapter that learns to choose the best possible sentences to construct generalizable classifiers that outperform the recently proposed CoCoOP by ~2% on average and by over 4% on 4 specialized fine-grained datasets. We will release the code, prompts, and auxiliary text dataset upon acceptance.
翻译:对比预训练的大型视觉-语言模型(如CLIP)通过在下游数据集上实现优异性能,彻底改变了视觉表示学习。VLMs通过设计与数据集相关的提示,以零样本方式适应下游数据集。这种提示工程利用了领域专业知识和验证数据集。与此同时,生成式预训练模型(如GPT-4)的最新发展使其能够作为先进的互联网搜索工具使用。它们还可被操控以提供任何结构的视觉信息。在这项工作中,我们展示了GPT-4可用于生成具有视觉描述性的文本,并阐明了如何利用这一点使CLIP适应下游任务。我们表明,在专门细粒度数据集(如EuroSAT(~7%)、DTD(~7%)、SUN397(~4.6%)和CUB(~3.3%))上,与CLIP的默认提示相比,零样本迁移准确率有显著提升。我们还设计了一个简单的少样本适配器,该适配器学习选择最佳句子来构建可泛化的分类器,其平均性能比近期提出的CoCoOP高出约2%,在4个专门细粒度数据集上超过4%。接收后,我们将发布代码、提示及辅助文本数据集。