Recent development of Large Vision-Language Models (LVLMs) has attracted growing attention within the AI landscape for its practical implementation potential. However, ``hallucination'', or more specifically, the misalignment between factual visual content and corresponding textual generation, poses a significant challenge of utilizing LVLMs. In this comprehensive survey, we dissect LVLM-related hallucinations in an attempt to establish an overview and facilitate future mitigation. Our scrutiny starts with a clarification of the concept of hallucinations in LVLMs, presenting a variety of hallucination symptoms and highlighting the unique challenges inherent in LVLM hallucinations. Subsequently, we outline the benchmarks and methodologies tailored specifically for evaluating hallucinations unique to LVLMs. Additionally, we delve into an investigation of the root causes of these hallucinations, encompassing insights from the training data and model components. We also critically review existing methods for mitigating hallucinations. The open questions and future directions pertaining to hallucinations within LVLMs are discussed to conclude this survey.
翻译:近年来,大型视觉-语言模型(LVLMs)的快速发展因其实际应用潜力而受到人工智能领域的广泛关注。然而,“幻觉”现象——更具体而言,是指视觉内容事实与相应文本生成之间的不一致——给LVLMs的应用带来了重大挑战。本综述旨在系统剖析LVLMs相关的幻觉问题,以建立整体认知并促进未来缓解方案的发展。我们首先澄清了LVLMs中幻觉的概念,阐述了多种幻觉表现特征,并强调了LVLM幻觉特有的挑战。随后,我们梳理了专门用于评估LVLM幻觉的基准数据集和方法论。此外,本文深入探讨了这些幻觉的根本原因,涵盖训练数据和模型组件的相关见解。我们还对现有缓解幻觉的方法进行了批判性回顾。最后,本文通过讨论LVLM幻觉中存在的开放性问题与未来研究方向作为结语。