Large Language Models (LLMs) have shown remarkable success, and their multimodal expansions (MLLMs) further unlock capabilities spanning images, videos, and other modalities beyond text. However, despite this shift, prompt optimization approaches, designed to reduce the burden of manual prompt crafting while maximizing performance, remain confined to text, ultimately limiting the full potential of MLLMs. Motivated by this gap, we introduce the new problem of multimodal prompt optimization, which expands the prior definition of prompt optimization to the multimodal space defined by the pairs of textual and non-textual prompts. To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection process of candidate prompts by leveraging earlier evaluations as priors in a Bayesian-based selection strategy. Through extensive experiments across diverse modalities that go beyond text, such as images, videos, and even molecules, we demonstrate that MPO outperforms leading text-only optimization methods, establishing multimodal prompt optimization as a crucial step to realizing the potential of MLLMs.
翻译:大型语言模型(LLMs)已展现出卓越成效,其多模态扩展模型(MLLMs)进一步释放了涵盖图像、视频及其他超越文本的模态能力。然而,尽管存在这种范式转变,旨在减轻人工提示设计负担并最大化性能的提示优化方法仍局限于文本领域,最终限制了MLLMs的完整潜力。基于这一现状,我们提出了多模态提示优化这一新问题,将先前提示优化的定义扩展至由文本与非文本提示对构成的多模态空间。为解决该问题,我们进一步提出多模态提示优化器(MPO)——一个不仅通过保持对齐的更新实现多模态提示联合优化,还能基于贝叶斯选择策略利用历史评估结果作为先验知识指导候选提示选择的统一框架。通过在图像、视频乃至分子结构等超越文本的多样化模态上进行大量实验,我们证明MPO性能优于当前主流的纯文本优化方法,从而确立多模态提示优化为实现MLLMs潜力的关键步骤。