Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages. However, LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images. This can negatively impact many vision-language tasks, such as visual summarization and reasoning. To address this issue, we propose a simple yet powerful algorithm, LVLM Hallucination Revisor (LURE), to post-hoc rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions. LURE is grounded in a rigorous statistical analysis of the key factors underlying object hallucination, including co-occurrence (the frequent appearance of certain objects alongside others in images), uncertainty (objects with higher uncertainty during LVLM decoding), and object position (hallucination often appears in the later part of the generated text). LURE can also be seamlessly integrated with any LVLMs. We evaluate LURE on six open-source LVLMs, achieving a 23% improvement in general object hallucination evaluation metrics over the previous best approach. In both GPT and human evaluations, LURE consistently ranks at the top. Our data and code are available at https://github.com/YiyangZhou/LURE.
翻译:大型视觉语言模型(LVLMs)在通过人类语言理解视觉信息方面展现了卓越的能力。然而,LVLMs仍存在对象幻觉问题,即生成的描述中包含图像中实际不存在的对象。这可能会对视觉摘要和推理等许多视觉语言任务产生负面影响。为解决这一问题,我们提出了一种简单而强大的算法——LVLM幻觉修正器(LURE),通过重构幻觉较少的描述来事后纠正LVLMs中的对象幻觉。LURE基于对对象幻觉背后关键因素的严格统计分析,这些因素包括共现(某些对象在图像中频繁与其他对象同时出现)、不确定性(LVLM解码过程中不确定性较高的对象)以及对象位置(幻觉通常出现在生成文本的较后部分)。LURE还可以无缝集成到任何LVLMs中。我们在六个开源LVLMs上评估了LURE,在通用对象幻觉评估指标上比先前最佳方法提升了23%。在GPT和人工评估中,LURE始终排名最高。我们的数据和代码可在https://github.com/YiyangZhou/LURE获取。