Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities on downstream tasks when fine-tuned with minimal data. However, many VLMs rely on proprietary data and are not open-source, which restricts the use of white-box approaches for fine-tuning. As such, we aim to develop a black-box approach to optimize VLMs through natural language prompts, thereby avoiding the need to access model parameters, feature embeddings, or even output logits. We propose employing chat-based LLMs to search for the best text prompt for VLMs. Specifically, we adopt an automatic hill-climbing procedure that converges to an effective prompt by evaluating the performance of current prompts and asking LLMs to refine them based on textual feedback, all within a conversational process without human-in-the-loop. In a challenging 1-shot image classification setup, our simple approach surpasses the white-box continuous prompting method (CoOp) by an average of 1.5% across 11 datasets including ImageNet. Our approach also outperforms both human-engineered and LLM-generated prompts. We highlight the advantage of conversational feedback that incorporates both positive and negative prompts, suggesting that LLMs can utilize the implicit gradient direction in textual feedback for a more efficient search. In addition, we find that the text prompts generated through our strategy are not only more interpretable but also transfer well across different VLM architectures in a black-box manner. Lastly, we apply our framework to optimize the state-of-the-art black-box VLM (DALL-E 3) for text-to-image generation, prompt inversion, and personalization.
翻译:在互联网规模数据集上预训练的视觉-语言模型(VLM)在微调少量数据后,在下游任务中展现出卓越能力。然而,许多VLM依赖专有数据且并非开源,这限制了白盒微调方法的使用。为此,我们旨在开发一种通过自然语言提示优化VLM的黑盒方法,从而避免访问模型参数、特征嵌入甚至输出logits的需求。我们提出利用基于聊天的LLM来搜索VLM的最佳文本提示。具体而言,我们采用一种自动爬山过程,通过评估当前提示的性能并基于文本反馈要求LLM对其进行改进,最终收敛至有效提示——整个过程在对话机制中完成,无需人工介入。在具有挑战性的1-shot图像分类设置中,我们的简单方法在包括ImageNet在内的11个数据集上平均超越了白盒连续提示方法(CoOp)1.5%。我们的方法也优于人工设计和LLM生成的提示。我们强调了结合正负提示的对话反馈的优势,表明LLM能够利用文本反馈中的隐式梯度方向进行更高效的搜索。此外,我们发现通过我们的策略生成的文本提示不仅更具可解释性,还能以黑盒方式跨不同VLM架构良好迁移。最后,我们将该框架应用于优化最先进的黑盒VLM(DALL-E 3),用于文本到图像生成、提示反演和个性化定制。