Language Models as Black-Box Optimizers for Vision-Language Models

Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities on downstream tasks when fine-tuned with minimal data. However, many VLMs rely on proprietary data and are not open-source, which restricts the use of white-box approaches for fine-tuning. As such, we aim to develop a black-box approach to optimize VLMs through natural language prompts, thereby avoiding the need to access model parameters, feature embeddings, or even output logits. We propose employing chat-based LLMs to search for the best text prompt for VLMs. Specifically, we adopt an automatic hill-climbing procedure that converges to an effective prompt by evaluating the performance of current prompts and asking LLMs to refine them based on textual feedback, all within a conversational process without human-in-the-loop. In a challenging 1-shot image classification setup, our simple approach surpasses the white-box continuous prompting method (CoOp) by an average of 1.5% across 11 datasets including ImageNet. Our approach also outperforms both human-engineered and LLM-generated prompts. We highlight the advantage of conversational feedback that incorporates both positive and negative prompts, suggesting that LLMs can utilize the implicit gradient direction in textual feedback for a more efficient search. In addition, we find that the text prompts generated through our strategy are not only more interpretable but also transfer well across different VLM architectures in a black-box manner. Lastly, we apply our framework to optimize the state-of-the-art black-box VLM (DALL-E 3) for text-to-image generation, prompt inversion, and personalization.

翻译：视觉语言模型（VLMs）在基于网络规模数据集预训练后，通过少量数据微调即可在下游任务中展现出卓越能力。然而，许多VLM依赖专有数据且未开源，这限制了白盒微调方法的应用。为此，我们旨在开发一种黑盒方法，通过自然语言提示优化VLM，从而避免访问模型参数、特征嵌入甚至输出logits的需求。我们提出利用基于对话的LLM为VLM搜索最佳文本提示。具体而言，我们采用自动爬山过程，通过评估当前提示的性能并基于文本反馈要求LLM对其进行改进，在无需人工介入的对话过程中收敛至有效提示。在具有挑战性的1-shot图像分类设置中，我们的简单方法在包括ImageNet在内的11个数据集上平均优于白盒连续提示方法（CoOp）1.5%。我们的方法还优于人工设计的提示和LLM生成的提示。我们强调结合正负提示的对话反馈的优势，表明LLM能够利用文本反馈中的隐式梯度方向进行更高效的搜索。此外，我们通过策略生成的文本提示不仅更具可解释性，还能以黑盒方式在不同VLM架构间良好迁移。最后，我们将框架应用于优化最先进的黑盒VLM（DALL-E 3）以执行文本到图像生成、提示反演和个性化任务。

相关内容

黑盒

关注 1

在科学，计算和工程学中，黑盒是一种设备，系统或对象，可以根据其输入和输出（或传输特性）对其进行查看，而无需对其内部工作有任何了解。它的实现是“不透明的”（黑色）。几乎任何事物都可以被称为黑盒：晶体管，引擎，算法，人脑，机构或政府。为了使用典型的“黑匣子方法”来分析建模为开放系统的事物，仅考虑刺激/响应的行为，以推断（未知）盒子。该黑匣子系统的通常表示形式是在该方框中居中的数据流程图。黑盒的对立面是一个内部组件或逻辑可用于检查的系统，通常将其称为白盒（有时也称为“透明盒”或“玻璃盒”）。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日