Large Vision-Language Models (LVLMs) can reason effectively from image-text inputs and perform well in various multimodal tasks. Despite this success, they are affected by language priors and often produce hallucinations. Hallucinations denote generated content that is grammatically and syntactically coherent, yet bears no match or direct relevance to actual visual input. To address this problem, we propose Residual Decoding (ResDec). It is a novel training-free method that uses historical information to aid decoding. The method relies on the internal implicit reasoning mechanism and token logits evolution mechanism of LVLMs to correct biases. Extensive experiments demonstrate that ResDec effectively suppresses hallucinations induced by language priors, significantly improves visual grounding, and reduces object hallucinations. In addition to mitigating hallucinations, ResDec also performs exceptionally well on comprehensive LVLM benchmarks, highlighting its broad applicability.
翻译:大型视觉语言模型(LVLMs)能够有效推理图像-文本输入,并在多种多模态任务中表现优异。尽管取得了这些成功,它们仍受到语言先验的影响,并经常产生幻觉。幻觉指生成的内容在语法和句法上连贯,但与实际视觉输入不匹配或没有直接关联。为解决此问题,我们提出了残差解码(ResDec)。这是一种新颖的免训练方法,利用历史信息辅助解码。该方法依赖于LVLMs内部的隐式推理机制和词元对数概率演化机制来纠正偏差。大量实验表明,ResDec有效抑制了由语言先验引发的幻觉,显著提升了视觉基础能力,并减少了物体幻觉。除了缓解幻觉外,ResDec在综合LVLM基准测试中也表现优异,突显了其广泛的适用性。