Vision-language models (VLMs) have made significant progress in image classification by training with large-scale paired image-text data. Their performances largely depend on the prompt quality. While recent methods show that visual descriptions generated by large language models (LLMs) enhance the generalization of VLMs, class-specific prompts may be inaccurate or lack discrimination due to the hallucination in LLMs. In this paper, we aim to find visually discriminative prompts for fine-grained categories with minimal supervision and no human-in-the-loop. An evolution-based algorithm is proposed to progressively optimize language prompts from task-specific templates to class-specific descriptions. Unlike optimizing templates, the search space shows an explosion in class-specific candidate prompts. This increases prompt generation costs, iterative times, and the overfitting problem. To this end, we first introduce several simple yet effective edit-based and evolution-based operations to generate diverse candidate prompts by one-time query of LLMs. Then, two sampling strategies are proposed to find a better initial search point and reduce traversed categories, saving iteration costs. Moreover, we apply a novel fitness score with entropy constraints to mitigate overfitting. In a challenging one-shot image classification setting, our method outperforms existing textual prompt-based methods and improves LLM-generated description methods across 13 datasets. Meanwhile, we demonstrate that our optimal prompts improve adapter-based methods and transfer effectively across different backbones.
翻译:视觉语言模型(VLMs)通过在大规模配对图像-文本数据上进行训练,在图像分类任务中取得了显著进展。其性能在很大程度上取决于提示词的质量。尽管近期研究表明,由大语言模型(LLMs)生成的视觉描述能够增强VLMs的泛化能力,但由于LLMs存在幻觉问题,针对特定类别的提示词可能不准确或缺乏区分性。本文旨在以最小监督且无需人工介入的方式,为细粒度类别寻找具有视觉区分度的提示词。我们提出一种基于进化的算法,将语言提示词从任务特定模板逐步优化至类别特定描述。与优化模板不同,类别特定候选提示词的搜索空间呈爆炸式增长,这增加了提示生成成本、迭代次数以及过拟合风险。为此,我们首先引入几种简单而有效的基于编辑和进化的操作,通过单次查询LLMs生成多样化的候选提示词。随后,提出两种采样策略以寻找更优的初始搜索点并减少遍历的类别数量,从而节省迭代成本。此外,我们采用一种结合熵约束的新型适应度评分来缓解过拟合问题。在具有挑战性的单样本图像分类设定下,本方法在13个数据集上超越了现有基于文本提示的方法,并改进了LLM生成描述的方法。同时,我们证明所获得的最优提示词能够提升基于适配器的方法的性能,并能有效迁移至不同骨干网络。