Large language models (LLMs) have exhibited impressive abilities for multimodal content comprehension and reasoning with proper prompting in zero- or few-shot settings. Despite the proliferation of interactive systems developed to support prompt engineering for LLMs across various tasks, most have primarily focused on textual or visual inputs, thus neglecting the complex interplay between modalities within multimodal inputs. This oversight hinders the development of effective prompts that guide model multimodal reasoning processes by fully exploiting the rich context provided by multiple modalities. In this paper, we present POEM, a visual analytics system to facilitate efficient prompt engineering for enhancing the multimodal reasoning performance of LLMs. The system enables users to explore the interaction patterns across modalities at varying levels of detail for a comprehensive understanding of the multimodal knowledge elicited by various prompts. Through diverse recommendations of demonstration examples and instructional principles, POEM supports users in iteratively crafting and refining prompts to better align and enhance model knowledge with human insights. The effectiveness and efficiency of our system are validated through two case studies and interviews with experts.
翻译:大语言模型(LLM)在零样本或少样本设置中,通过恰当的提示,已展现出令人印象深刻的多模态内容理解与推理能力。尽管为支持LLM在各种任务中的提示工程已涌现出众多交互式系统,但大多数主要关注文本或视觉输入,从而忽视了多模态输入中不同模态间复杂的相互作用。这种疏忽阻碍了有效提示的开发,这些提示本应通过充分利用多模态提供的丰富上下文来指导模型的多模态推理过程。本文提出POEM,一个用于促进高效提示工程以增强LLM多模态推理性能的可视化分析系统。该系统使用户能够探索不同细节层次上跨模态的交互模式,从而全面理解各种提示所引出的多模态知识。通过提供演示示例和指导原则的多样化推荐,POEM支持用户迭代式地构建和优化提示,以更好地对齐并增强模型知识与人类洞察。我们通过两个案例研究及专家访谈验证了系统的有效性和效率。