Large Vision-Language Models (LVLMs) have advanced considerably, intertwining visual recognition and language understanding to generate content that is not only coherent but also contextually attuned. Despite their success, LVLMs still suffer from the issue of object hallucinations, where models generate plausible yet incorrect outputs that include objects that do not exist in the images. To mitigate this issue, we introduce Visual Contrastive Decoding (VCD), a simple and training-free method that contrasts output distributions derived from original and distorted visual inputs. The proposed VCD effectively reduces the over-reliance on statistical bias and unimodal priors, two essential causes of object hallucinations. This adjustment ensures the generated content is closely grounded to visual inputs, resulting in contextually accurate outputs. Our experiments show that VCD, without either additional training or the usage of external tools, significantly mitigates the object hallucination issue across different LVLM families. Beyond mitigating object hallucinations, VCD also excels in general LVLM benchmarks, highlighting its wide-ranging applicability.
翻译:大型视觉-语言模型(LVLMs)取得了显著进展,将视觉识别与语言理解相结合,生成不仅连贯且上下文相关的输出内容。尽管这些模型取得了一定成功,但仍存在对象幻觉问题——模型会生成看似合理但包含图像中不存在对象的错误输出。为缓解这一问题,我们提出了视觉对比解码(Visual Contrastive Decoding, VCD),这是一种无需训练且方法简洁的技术,通过对比从原始与扭曲视觉输入生成的输出分布来消除幻觉。所提出的VCD方法有效降低了对统计偏差和单模态先验的过度依赖——这两者是导致对象幻觉的关键原因。这种调整确保生成内容紧密贴合视觉输入,从而产生上下文准确的输出。实验表明,VCD无需额外训练或借助外部工具,即可在不同LVLM系列中显著缓解对象幻觉问题。除缓解对象幻觉外,VCD在通用LVLM基准测试中也表现优异,凸显了其广泛的适用性。