Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities across a variety of vision and multimodal tasks. Currently, fine-tuning methods for VLMs mainly operate in a white-box setting, requiring access to model parameters for backpropagation. However, many VLMs rely on proprietary data and are not open-source, which restricts the use of white-box approaches for fine-tuning. Given that popular private large language models (LLMs) like ChatGPT still offer a language-based user interface, we aim to develop a novel fine-tuning approach for VLMs through natural language prompts, thereby avoiding the need to access model parameters, feature embeddings, or output logits. In this setup, we propose employing chat-based LLMs as black-box optimizers to search for the best text prompt on the illustrative task of few-shot image classification using CLIP. Specifically, we adopt an automatic "hill-climbing" procedure that converges on an effective prompt by evaluating the accuracy of current prompts and asking LLMs to refine them based on textual feedback, all within a conversational process without human-in-the-loop. In a challenging 1-shot learning setup, our simple approach surpasses the white-box continuous prompting method (CoOp) by an average of 1.5% across 11 datasets including ImageNet. Our approach also outperforms OpenAI's manually crafted prompts. Additionally, we highlight the advantage of conversational feedback that incorporates both positive and negative prompts, suggesting that LLMs can utilize the implicit "gradient" direction in textual feedback for a more efficient search. Lastly, we find that the text prompts generated through our strategy are not only more interpretable but also transfer well across different CLIP architectures in a black-box manner.
翻译:视觉-语言模型(VLM)经过网络规模数据集的预训练,已在多种视觉和多模态任务中展现出卓越能力。当前,VLM的微调方法主要在白盒环境下运行,需要访问模型参数进行反向传播。然而,许多VLM依赖专有数据且未开源,这限制了白盒方法在微调中的应用。鉴于ChatGPT等流行的私有大型语言模型(LLM)仍提供语言交互界面,我们旨在开发一种通过自然语言提示微调VLM的新方法,从而避免访问模型参数、特征嵌入或输出logits。在该设定下,我们提出利用基于聊天的LLM作为黑盒优化器,以少样本图像分类任务(使用CLIP)为例搜索最佳文本提示。具体而言,我们采用自动"爬山"流程:通过评估当前提示的准确率,并让LLM基于文本反馈对其进行优化,整个对话过程无需人工干预,最终收敛至有效提示。在具有挑战性的1样本学习设定中,我们的简单方法在包含ImageNet的11个数据集上平均超越白盒连续提示方法(CoOp)1.5%。我们的方法还优于OpenAI手工设计的提示。此外,我们强调了融合正向与负向提示的对话式反馈的优势,表明LLM可利用文本反馈中隐含的"梯度"方向实现更高效的搜索。最后,我们通过该策略生成的文本提示不仅具有更强的可解释性,还能以黑盒方式跨不同CLIP架构实现良好迁移。