Inspired by the superior language abilities of large language models (LLM), large vision-language models (LVLM) have been recently explored by integrating powerful LLMs for improving the performance on complex multimodal tasks. Despite the promising progress on LVLMs, we find that LVLMs suffer from the hallucination problem, i.e. they tend to generate objects that are inconsistent with the target images in the descriptions. To investigate it, this work presents the first systematic study on object hallucination of LVLMs. We conduct the evaluation experiments on several representative LVLMs, and show that they mostly suffer from severe object hallucination issue. We further discuss that the visual instructions may influence the hallucination, and find that: objects that frequently occur in the visual instructions or co-occur with the image objects, are obviously prone to be hallucinated by LVLMs. Besides, we find that existing evaluation methods might be affected by the input instructions and generation styles of LVLMs. Thus, we further design an improved evaluation method for object hallucination by proposing a polling-based query method called POPE. Experiment results demonstrate that our POPE can evaluate the object hallucination in a more stable and flexible way. Our codes and data are publicly available at https://github.com/RUCAIBox/POPE.
翻译:受大型语言模型(LLM)卓越语言能力的启发,近年来通过集成强大的LLM来提升复杂多模态任务性能的大型视觉语言模型(LVLM)得到了广泛探索。尽管LVLM取得了令人鼓舞的进展,但我们发现LVLM存在幻觉问题,即它们倾向于生成与目标图像描述不一致的物体。为探究这一问题,本文首次系统性地研究了LVLM的物体幻觉现象。我们对多个代表性LVLM进行了评估实验,结果表明它们大多存在严重的物体幻觉问题。我们进一步探讨了视觉指令可能对幻觉的影响,并发现:在视觉指令中频繁出现或与图像物体共现的物体,更容易被LVLM产生幻觉。此外,我们发现现有评估方法可能受到输入指令和LVLM生成风格的影响。为此,我们设计了一种改进的物体幻觉评估方法——基于轮询的查询方法POPE。实验结果表明,POPE能够以更稳定、更灵活的方式评估物体幻觉。我们的代码和数据已在https://github.com/RUCAIBox/POPE公开。