Hallucinations in multimodal large language models (MLLMs) hinder their practical applications. To address this, we propose a Magnifier Prompt (MagPrompt), a simple yet effective method to tackle hallucinations in MLLMs via extremely simple instructions. MagPrompt is based on the following two key principles, which guide the design of various effective prompts, demonstrating robustness: (1) MLLMs should focus more on the image. (2) When there are conflicts between the image and the model's inner knowledge, MLLMs should prioritize the image. MagPrompt is training-free and can be applied to open-source and closed-source models, such as GPT-4o and Gemini-pro. It performs well across many datasets and its effectiveness is comparable or even better than more complex methods like VCD. Furthermore, our prompt design principles and experimental analyses provide valuable insights into multimodal hallucination.
翻译:多模态大语言模型(MLLMs)中的幻觉问题阻碍了其实际应用。为解决此问题,我们提出了一种放大镜提示(MagPrompt),这是一种通过极简指令应对MLLMs幻觉的简单而有效的方法。MagPrompt基于以下两个关键原则,这些原则指导了多种有效提示的设计,并展现了鲁棒性:(1)MLLMs应更专注于图像。(2)当图像与模型内部知识存在冲突时,MLLMs应优先考虑图像。MagPrompt无需训练,可应用于开源和闭源模型,如GPT-4o和Gemini-pro。它在多个数据集上表现良好,其有效性可与更复杂的方法(如VCD)相媲美甚至更优。此外,我们的提示设计原则和实验分析为理解多模态幻觉提供了有价值的见解。