Large multimodal models (LMMs) suffer from multimodal hallucination, where they provide incorrect responses misaligned with the given visual information. Recent works have conjectured that one of the reasons behind multimodal hallucination might be due to the vision encoder failing to ground on the image properly. To mitigate this issue, we propose a novel approach that leverages self-feedback as visual cues. Building on this approach, we introduce Volcano, a multimodal self-feedback guided revision model. Volcano generates natural language feedback to its initial response based on the provided visual information and utilizes this feedback to self-revise its initial response. Volcano effectively reduces multimodal hallucination and achieves state-of-the-art on MMHal-Bench, POPE, and GAVIE. It also improves on general multimodal abilities and outperforms previous models on MM-Vet and MMBench. Through a qualitative analysis, we show that Volcano's feedback is properly grounded on the image than the initial response. This indicates that Volcano can provide itself with richer visual information, helping alleviate multimodal hallucination. We publicly release Volcano models of 7B and 13B sizes along with the data and code at https://github.com/kaistAI/Volcano.
翻译:大型多模态模型(LMMs)存在多模态幻觉问题,即会提供与给定视觉信息不一致的错误响应。近期研究表明,多模态幻觉的成因之一可能是视觉编码器未能正确理解图像。为解决此问题,我们提出一种将自我反馈作为视觉提示的新方法。基于该方法,我们引入火山模型——一种多模态自我反馈引导修正模型。火山模型根据提供的视觉信息生成对其初始响应的自然语言反馈,并利用该反馈进行自我修正。该模型有效减少了多模态幻觉,在MMHal-Bench、POPE和GAVIE基准测试中达到最优性能。同时,其通用多模态能力得到提升,在MM-Vet和MMBench上超越此前模型。定性分析表明,火山模型的反馈比初始响应更准确地基于图像信息,这表明该模型可为其自身提供更丰富的视觉信息,有助于缓解多模态幻觉。我们公开了7B和13B参数的火山模型,以及相关数据和代码,详见https://github.com/kaistAI/Volcano。