Large Vision-Language Models (LVLMs) demonstrate impressive capabilities in generating detailed and coherent responses from visual inputs. However, they are prone to generate hallucinations due to an over-reliance on language priors. To address this issue, we investigate the language priors in LVLMs and make two key observations: (1) Even when predicting the tokens associated with image-related part-of-speech (POS), models increasingly rely on linguistic priors as the token sequences grow, thereby amplifying hallucinations. (2) Methods that directly calibrate LVLM's output distribution to mitigate language priors can lead to a degradation in text quality or even exacerbate hallucinations. Based on these findings, we propose a novel method, Summary-Guided Decoding (SGD). This method naturally encourages the model to focus more on image information by reducing the text context through summaries, while controlling only the image-related POS tokens to maintain text quality. Through experiments, we demonstrate that SGD achieves state-of-the-art performance on object hallucination benchmarks. Furthermore, in terms of the trade-off between precision and recall, SGD achieves Pareto optimality among the existing methods. Lastly, we observe that although existing methods struggle to balance the reduction of object hallucinations with maintaining text quality, SGD demonstrates robustness in handling this challenge.
翻译:大型视觉语言模型(LVLMs)在根据视觉输入生成详细且连贯的响应方面展现出令人印象深刻的能力。然而,由于过度依赖语言先验,它们容易产生幻觉。为解决此问题,我们研究了LVLMs中的语言先验,并得出两个关键观察:(1)即使在预测与图像相关词性(POS)关联的标记时,随着标记序列的增长,模型对语言先验的依赖会逐渐增强,从而加剧幻觉。(2)直接通过校准LVLM的输出分布来缓解语言先验的方法可能导致文本质量下降,甚至加剧幻觉。基于这些发现,我们提出了一种新方法——摘要引导解码(SGD)。该方法通过摘要减少文本上下文,自然地促使模型更关注图像信息,同时仅控制与图像相关的POS标记以保持文本质量。实验表明,SGD在物体幻觉基准测试中取得了最先进的性能。此外,在精确率与召回率的权衡方面,SGD在现有方法中实现了帕累托最优。最后,我们观察到,尽管现有方法难以在减少物体幻觉与保持文本质量之间取得平衡,但SGD在处理这一挑战时表现出鲁棒性。