Open-vocabulary models are a promising new paradigm for image classification. Unlike traditional classification models, open-vocabulary models classify among any arbitrary set of categories specified with natural language during inference. This natural language, called "prompts", typically consists of a set of hand-written templates (e.g., "a photo of a {}") which are completed with each of the category names. This work introduces a simple method to generate higher accuracy prompts, without relying on any explicit knowledge of the task domain and with far fewer hand-constructed sentences. To achieve this, we combine open-vocabulary models with large language models (LLMs) to create Customized Prompts via Language models (CuPL, pronounced "couple"). In particular, we leverage the knowledge contained in LLMs in order to generate many descriptive sentences that contain important discriminating characteristics of the image categories. This allows the model to place a greater importance on these regions in the image when making predictions. We find that this straightforward and general approach improves accuracy on a range of zero-shot image classification benchmarks, including over one percentage point gain on ImageNet. Finally, this simple baseline requires no additional training and remains completely zero-shot. Code available at https://github.com/sarahpratt/CuPL.
翻译:开放词汇模型是图像分类领域一种具有前景的新范式。与传统分类模型不同,开放词汇模型可在推理过程中对任意用自然语言指定的类别集进行分类。这种自然语言被称为“提示词”,通常由一组手工编写的模板(如“一张{}的照片”)组成,这些模板需填入每个类别名称。本文提出一种简单方法,可在不依赖任务领域的显式知识、且使用更少手工构建语句的情况下,生成准确性更高的提示词。为实现这一目标,我们将开放词汇模型与大语言模型结合,创建了基于语言模型的定制化提示词(CuPL,读音同"couple")。具体而言,我们利用大语言模型中蕴含的知识,生成大量包含图像类别重要区分特征的描述性句子。这使得模型在进行预测时,能够对图像中这些区域赋予更高权重。我们发现,这种直接且通用的方法在多项零样本图像分类基准测试中提升了准确率,包括在ImageNet上提升超过一个百分点。最终,这一简单基线无需额外训练,且完全保持零样本特性。代码开源地址:https://github.com/sarahpratt/CuPL。