Large multimodal models (LMMs) suffer from multimodal hallucination, where they provide incorrect responses misaligned with the given visual information. Recent works have conjectured that one of the reasons behind multimodal hallucination might be due to the vision encoder failing to ground on the image properly. To mitigate this issue, we propose a novel approach that leverages self-feedback as visual cues. Building on this approach, we introduce Volcano, a multimodal self-feedback guided revision model. Volcano generates natural language feedback to its initial response based on the provided visual information and utilizes this feedback to self-revise its initial response. Volcano effectively reduces multimodal hallucination and achieves state-of-the-art on MMHal-Bench, POPE, and GAVIE. It also improves on general multimodal abilities and outperforms previous models on MM-Vet and MMBench. Through a qualitative analysis, we show that Volcano's feedback is properly grounded on the image than the initial response. This indicates that Volcano can provide itself with richer visual information, helping alleviate multimodal hallucination. We publicly release Volcano models of 7B and 13B sizes along with the data and code at https://github.com/kaistAI/Volcano.
翻译:大型多模态模型存在多模态幻觉问题,即其生成的错误响应与给定的视觉信息不一致。近期研究推测,多模态幻觉的原因之一可能是视觉编码器未能正确对图像进行定位。为缓解这一问题,我们提出了一种利用自反馈作为视觉线索的新方法。基于该方法,我们引入了Volcano——一种多模态自反馈引导修正模型。Volcano根据提供的视觉信息对其初始响应生成自然语言反馈,并利用该反馈自我修正初始响应。Volcano有效减少了多模态幻觉,并在MMHal-Bench、POPE和GAVIE基准上达到了最先进的性能。此外,它在通用多模态能力上也有所提升,在MM-Vet和MMBench基准上优于先前模型。通过定性分析,我们发现Volcano的反馈比初始响应能更准确地定位图像。这表明Volcano可为其自身提供更丰富的视觉信息,从而有助于缓解多模态幻觉。我们公开发布了7B和13B规模的Volcano模型及数据和代码,详见https://github.com/kaistAI/Volcano。