Large Vision-Language Models (LVLMs) are susceptible to object hallucinations, an issue in which their generated text contains non-existent objects, greatly limiting their reliability and practicality. Current approaches often rely on the model's token likelihoods or other internal information, instruction tuning on additional datasets, or incorporating complex external tools. We first perform empirical analysis on sentence-level LVLM hallucination, finding that CLIP similarity to the image acts as a stronger and more robust indicator of hallucination compared to token likelihoods. Motivated by this, we introduce our CLIP-Guided Decoding (CGD) approach, a straightforward but effective training-free approach to reduce object hallucination at decoding time. CGD uses CLIP to guide the model's decoding process by enhancing visual grounding of generated text with the image. Experiments demonstrate that CGD effectively mitigates object hallucination across multiple LVLM families while preserving the utility of text generation. Codes are available at https://github.com/d-ailin/CLIP-Guided-Decoding.
翻译:大型视觉语言模型(LVLMs)容易产生对象幻觉问题,即其生成的文本包含不存在的对象,这极大地限制了模型的可靠性和实用性。当前方法通常依赖于模型的标记似然或其他内部信息、在额外数据集上进行指令微调,或集成复杂的外部工具。我们首先对句子级别的LVLM幻觉进行实证分析,发现相较于标记似然,图像与文本的CLIP相似度是更强大且更稳健的幻觉指示器。基于此,我们提出了CLIP引导解码(CGD)方法,这是一种简单而有效的无需训练的解码阶段对象幻觉减轻方法。CGD通过利用CLIP增强生成文本与图像的视觉对齐来引导模型的解码过程。实验表明,CGD能在保持文本生成效用的同时,有效减轻多种LVLM系列模型中的对象幻觉。代码已开源至https://github.com/d-ailin/CLIP-Guided-Decoding。