Generative Vision-Language Models (VLMs) are prone to generate plausible-sounding textual answers that, however, are not always grounded in the input image. We investigate this phenomenon, usually referred to as "hallucination" and show that it stems from an excessive reliance on the language prior. In particular, we show that as more tokens are generated, the reliance on the visual prompt decreases, and this behavior strongly correlates with the emergence of hallucinations. To reduce hallucinations, we introduce Multi-Modal Mutual-Information Decoding (M3ID), a new sampling method for prompt amplification. M3ID amplifies the influence of the reference image over the language prior, hence favoring the generation of tokens with higher mutual information with the visual prompt. M3ID can be applied to any pre-trained autoregressive VLM at inference time without necessitating further training and with minimal computational overhead. If training is an option, we show that M3ID can be paired with Direct Preference Optimization (DPO) to improve the model's reliance on the prompt image without requiring any labels. Our empirical findings show that our algorithms maintain the fluency and linguistic capabilities of pre-trained VLMs while reducing hallucinations by mitigating visually ungrounded answers. Specifically, for the LLaVA 13B model, M3ID and M3ID+DPO reduce the percentage of hallucinated objects in captioning tasks by 25% and 28%, respectively, and improve the accuracy on VQA benchmarks such as POPE by 21% and 24%.
翻译:生成式视觉-语言模型(VLMs)倾向于生成听似合理的文本答案,但这些答案并不总是与输入图像相关。我们研究了这一通常被称为“幻觉”的现象,并表明其根源在于对语言先验的过度依赖。具体而言,我们发现随着生成更多标记,模型对视觉提示的依赖逐渐降低,且这种行为与幻觉的出现高度相关。为减少幻觉,我们引入多模态互信息解码(M3ID),这是一种用于提示增强的新型采样方法。M3ID增强了参考图像相对于语言先验的影响,从而倾向于生成与视觉提示具有更高互信息的标记。M3ID可在推理时应用于任何预训练的自回归VLM,无需额外训练且计算开销极小。若训练可行,我们证明M3ID可与直接偏好优化(DPO)结合,在不需任何标注的情况下提高模型对提示图像的依赖。我们的实证结果表明,所提算法在保持预训练VLM流畅性与语言能力的同时,通过减少视觉上无依据的答案来降低幻觉。具体而言,对于LLaVA 13B模型,M3ID和M3ID+DPO在描述任务中分别将幻觉对象的比例降低25%和28%,并在POPE等VQA基准上将准确率分别提升21%和24%。