Recent interest in Large Vision-Language Models (LVLMs) for practical applications is moderated by the significant challenge of hallucination or the inconsistency between the factual information and the generated text. In this paper, we first perform an in-depth analysis of hallucinations and discover several novel insights about how and when LVLMs hallucinate. From our analysis, we show that: (1) The community's efforts have been primarily targeted towards reducing hallucinations related to visual recognition (VR) prompts (e.g., prompts that only require describing the image), thereby ignoring hallucinations for cognitive prompts (e.g., prompts that require additional skills like reasoning on contents of the image). (2) LVLMs lack visual perception, i.e., they can see but not necessarily understand or perceive the input image. We analyze responses to cognitive prompts and show that LVLMs hallucinate due to a perception gap: although LVLMs accurately recognize visual elements in the input image and possess sufficient cognitive skills, they struggle to respond accurately and hallucinate. To overcome this shortcoming, we propose Visual Description Grounded Decoding (VDGD), a simple, robust, and training-free method for alleviating hallucinations. Specifically, we first describe the image and add it as a prefix to the instruction. Next, during auto-regressive decoding, we sample from the plausible candidates according to their KL-Divergence (KLD) to the description, where lower KLD is given higher preference. Experimental results on several benchmarks and LVLMs show that VDGD improves significantly over other baselines in reducing hallucinations. We also propose VaLLu, a benchmark for the comprehensive evaluation of the cognitive capabilities of LVLMs.
翻译:当前,大型视觉语言模型(LVLMs)在实际应用中的潜力受到幻觉问题的严重制约,即生成文本与事实信息之间存在不一致性。本文首先对幻觉现象进行了深入分析,揭示了关于LVLMs何时及如何产生幻觉的若干新发现。我们的分析表明:(1)学界现有工作主要集中于减少与视觉识别(VR)提示相关的幻觉(例如仅需描述图像的提示),而忽视了认知提示中的幻觉问题(例如需要基于图像内容进行推理等额外技能的提示)。(2)LVLMs缺乏视觉感知能力,即它们能够“看见”输入图像,却未必能真正理解或感知其内容。通过对认知提示响应的分析,我们发现LVLMs产生幻觉源于一种感知差距:尽管LVLMs能够准确识别输入图像中的视觉元素,并具备足够的认知技能,却难以作出准确响应从而导致幻觉。为克服这一缺陷,我们提出视觉描述锚定解码(VDGD),这是一种简单、鲁棒且无需训练的方法,用于缓解幻觉问题。具体而言,我们首先对图像进行描述,并将其作为指令的前缀。随后,在自回归解码过程中,我们根据候选序列与描述之间的KL散度(KLD)进行采样,优先选择KLD较低的候选。在多个基准测试和LVLMs上的实验结果表明,VDGD在减少幻觉方面显著优于其他基线方法。此外,我们还提出了VaLLu基准,用于全面评估LVLMs的认知能力。