Despite the significant success of Large Vision-Language models(LVLMs), these models still suffer hallucinations when describing images, generating answers that include non-existent objects. It is reported that these models tend to over-focus on certain irrelevant image tokens that do not contain critical information for answering the question and distort the output. To address this, we propose an Instruction-Aligned Visual Attention(IAVA) approach, which identifies irrelevant tokens by comparing changes in attention weights under two different instructions. By applying contrastive decoding, we dynamically adjust the logits generated from original image tokens and irrelevant image tokens, reducing the model's over-attention to irrelevant information. The experimental results demonstrate that IAVA consistently outperforms existing decoding techniques on benchmarks such as MME, POPE, and TextVQA in mitigating object hallucinations. Our IAVA approach is available online at https://github.com/Lee-lab558/IAVA.
翻译:尽管大型视觉语言模型(LVLMs)已取得显著成功,但这些模型在描述图像时仍会产生幻觉,生成包含不存在对象的答案。据报道,这些模型倾向于过度关注某些不相关的图像标记,这些标记并不包含回答问题的关键信息,从而扭曲了输出结果。为解决此问题,我们提出了一种指令对齐视觉注意力(IAVA)方法,该方法通过比较两种不同指令下注意力权重的变化来识别不相关的标记。通过应用对比解码技术,我们动态调整原始图像标记与不相关图像标记生成的逻辑值,从而降低模型对无关信息的过度关注。实验结果表明,在MME、POPE和TextVQA等基准测试中,IAVA在缓解对象幻觉方面持续优于现有解码技术。我们的IAVA方法已在https://github.com/Lee-lab558/IAVA在线发布。