A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models

Prompt engineering is a technique that involves augmenting a large pre-trained model with task-specific hints, known as prompts, to adapt the model to new tasks. Prompts can be created manually as natural language instructions or generated automatically as either natural language instructions or vector representations. Prompt engineering enables the ability to perform predictions based solely on prompts without updating model parameters, and the easier application of large pre-trained models in real-world tasks. In past years, Prompt engineering has been well-studied in natural language processing. Recently, it has also been intensively studied in vision-language modeling. However, there is currently a lack of a systematic overview of prompt engineering on pre-trained vision-language models. This paper aims to provide a comprehensive survey of cutting-edge research in prompt engineering on three types of vision-language models: multimodal-to-text generation models (e.g. Flamingo), image-text matching models (e.g. CLIP), and text-to-image generation models (e.g. Stable Diffusion). For each type of model, a brief model summary, prompting methods, prompting-based applications, and the corresponding responsibility and integrity issues are summarized and discussed. Furthermore, the commonalities and differences between prompting on vision-language models, language models, and vision models are also discussed. The challenges, future directions, and research opportunities are summarized to foster future research on this topic.

翻译：提示工程是一种通过向大规模预训练模型注入任务特定提示（即提示）来使模型适应新任务的技术。提示可以以自然语言指令的形式手动创建，也可以自动生成为自然语言指令或向量表示。提示工程支持仅基于提示进行预测而无需更新模型参数，并降低了大规模预训练模型在实际任务中的应用门槛。近年来，提示工程在自然语言处理领域已得到充分研究。最近，该技术也被广泛应用于视觉-语言建模领域。然而，当前尚缺乏对预训练视觉-语言模型提示工程的系统性综述。本文旨在对三类视觉-语言模型（多模态到文本生成模型，例如Flamingo；图像-文本匹配模型，例如CLIP；以及文本到图像生成模型，例如Stable Diffusion）中提示工程的前沿研究进行全面调研。针对每类模型，本文分别总结并讨论了其简要模型概述、提示方法、基于提示的应用以及相应的责任与完整性议题。此外，本文还探讨了视觉-语言模型、语言模型与视觉模型在提示方法上的共性与差异。最后，本文总结了当前面临的挑战、未来发展方向及研究机遇，以推动该领域的后续研究。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/