ProAPO: Progressively Automatic Prompt Optimization for Visual Classification

Vision-language models (VLMs) have made significant progress in image classification by training with large-scale paired image-text data. Their performances largely depend on the prompt quality. While recent methods show that visual descriptions generated by large language models (LLMs) enhance the generalization of VLMs, class-specific prompts may be inaccurate or lack discrimination due to the hallucination in LLMs. In this paper, we aim to find visually discriminative prompts for fine-grained categories with minimal supervision and no human-in-the-loop. An evolution-based algorithm is proposed to progressively optimize language prompts from task-specific templates to class-specific descriptions. Unlike optimizing templates, the search space shows an explosion in class-specific candidate prompts. This increases prompt generation costs, iterative times, and the overfitting problem. To this end, we first introduce several simple yet effective edit-based and evolution-based operations to generate diverse candidate prompts by one-time query of LLMs. Then, two sampling strategies are proposed to find a better initial search point and reduce traversed categories, saving iteration costs. Moreover, we apply a novel fitness score with entropy constraints to mitigate overfitting. In a challenging one-shot image classification setting, our method outperforms existing textual prompt-based methods and improves LLM-generated description methods across 13 datasets. Meanwhile, we demonstrate that our optimal prompts improve adapter-based methods and transfer effectively across different backbones.

翻译：视觉语言模型（VLMs）通过在大规模配对图像-文本数据上进行训练，在图像分类任务中取得了显著进展。其性能在很大程度上取决于提示词的质量。尽管近期研究表明，由大语言模型（LLMs）生成的视觉描述能够增强VLMs的泛化能力，但由于LLMs存在幻觉问题，针对特定类别的提示词可能不准确或缺乏区分性。本文旨在以最小监督且无需人工介入的方式，为细粒度类别寻找具有视觉区分度的提示词。我们提出一种基于进化的算法，将语言提示词从任务特定模板逐步优化至类别特定描述。与优化模板不同，类别特定候选提示词的搜索空间呈爆炸式增长，这增加了提示生成成本、迭代次数以及过拟合风险。为此，我们首先引入几种简单而有效的基于编辑和进化的操作，通过单次查询LLMs生成多样化的候选提示词。随后，提出两种采样策略以寻找更优的初始搜索点并减少遍历的类别数量，从而节省迭代成本。此外，我们采用一种结合熵约束的新型适应度评分来缓解过拟合问题。在具有挑战性的单样本图像分类设定下，本方法在13个数据集上超越了现有基于文本提示的方法，并改进了LLM生成描述的方法。同时，我们证明所获得的最优提示词能够提升基于适配器的方法的性能，并能有效迁移至不同骨干网络。