Prompt ensembling of Large Language Model (LLM) generated category-specific prompts has emerged as an effective method to enhance zero-shot recognition ability of Vision-Language Models (VLMs). To obtain these category-specific prompts, the present methods rely on hand-crafting the prompts to the LLMs for generating VLM prompts for the downstream tasks. However, this requires manually composing these task-specific prompts and still, they might not cover the diverse set of visual concepts and task-specific styles associated with the categories of interest. To effectively take humans out of the loop and completely automate the prompt generation process for zero-shot recognition, we propose Meta-Prompting for Visual Recognition (MPVR). Taking as input only minimal information about the target task, in the form of its short natural language description, and a list of associated class labels, MPVR automatically produces a diverse set of category-specific prompts resulting in a strong zero-shot classifier. MPVR generalizes effectively across various popular zero-shot image recognition benchmarks belonging to widely different domains when tested with multiple LLMs and VLMs. For example, MPVR obtains a zero-shot recognition improvement over CLIP by up to 19.8% and 18.2% (5.0% and 4.5% on average over 20 datasets) leveraging GPT and Mixtral LLMs, respectively
翻译:利用大语言模型生成的类别特定提示进行提示集成,已成为增强视觉语言模型零样本识别能力的有效方法。现有方法需手动构建针对大语言模型的提示,以生成用于下游任务的视觉语言模型提示。然而,这种方法不仅需要手动编写任务特定提示,还可能无法覆盖目标类别相关的多样化视觉概念与任务特定风格。为有效将人类从循环中移除并完全自动化零样本识别的提示生成过程,我们提出基于元提示的视觉识别方法(Meta-Prompting for Visual Recognition, MPVR)。该方法仅以目标任务的最小信息(即简短的自然语言描述及对应类标签列表)作为输入,即可自动生成多样化的类别特定提示,从而构建强大的零样本分类器。在多个流行的大语言模型和视觉语言模型测试中,MPVR在多个不同领域的零样本图像识别基准上均表现出强大的泛化能力。例如,利用GPT和Mixtral大语言模型时,MPVR相比CLIP的零样本识别性能分别提升高达19.8%和18.2%(在20个数据集上平均提升5.0%和4.5%)。