Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities across a variety of vision and multimodal tasks. Currently, fine-tuning methods for VLMs mainly operate in a white-box setting, requiring access to model parameters for backpropagation. However, many VLMs rely on proprietary data and are not open-source, which restricts the use of white-box approaches for fine-tuning. Given that popular private large language models (LLMs) like ChatGPT still offer a language-based user interface, we aim to develop a novel fine-tuning approach for VLMs through natural language prompts, thereby avoiding the need to access model parameters, feature embeddings, or output logits. In this setup, we propose employing chat-based LLMs as black-box optimizers to search for the best text prompt on the illustrative task of few-shot image classification using CLIP. Specifically, we adopt an automatic "hill-climbing" procedure that converges on an effective prompt by evaluating the accuracy of current prompts and asking LLMs to refine them based on textual feedback, all within a conversational process without human-in-the-loop. In a challenging 1-shot learning setup, our simple approach surpasses the white-box continuous prompting method CoOp by an average of 1.5% across 11 datasets including ImageNet. Our approach also outperforms OpenAI's manually crafted prompts and is more efficient than other black-box methods like iterative APE. Additionally, we highlight the advantage of conversational feedback incorporating both positive and negative prompts, suggesting that LLMs can utilize the implicit "gradient" direction in textual feedback for a more efficient search. Lastly, we find that the text prompts generated through our strategy are not only more interpretable but also transfer well across different CLIP architectures in a black-box manner.
翻译:视觉-语言模型(VLM)经过网络规模数据集的预训练,在各类视觉和多模态任务中展现出卓越能力。目前,VLM的微调方法主要运行在白盒设置中,需要访问模型参数以进行反向传播。然而,许多VLM依赖专有数据且未开源,这限制了白盒方法在微调中的应用。鉴于ChatGPT等流行的私有大型语言模型(LLM)仍提供基于语言的用户界面,我们旨在通过自然语言提示开发一种新颖的VLM微调方法,从而避免访问模型参数、特征嵌入或输出对数概率。在此设置中,我们提出利用基于聊天的LLM作为黑盒优化器,以少样本图像分类任务(使用CLIP)为例搜索最佳文本提示。具体而言,我们采用一种自动化的“爬山”流程,通过评估当前提示的准确率并请求LLM基于文本反馈优化提示(全程为对话过程,无需人工干预)来收敛至有效提示。在具有挑战性的单样本学习设置中,我们简单的方法在包含ImageNet在内的11个数据集上,平均性能比白盒连续提示方法CoOp高出1.5%。我们的方法还优于OpenAI手工设计的提示,且比迭代APE等其他黑盒方法更高效。此外,我们强调了结合正负提示的对话式反馈的优势,表明LLM能利用文本反馈中的隐式“梯度”方向实现更高效的搜索。最后,我们发现通过我们的策略生成的文本提示不仅更具可解释性,还能以黑盒方式在不同CLIP架构间良好迁移。