Despite great success across various multimodal tasks, Large Vision-Language Models (LVLMs) often encounter object hallucinations with generated textual responses being inconsistent with the actual objects in images. We examine different LVLMs and pinpoint that one root cause of object hallucinations lies with deficient attention on discriminative image features. Specifically, LVLMs often predominantly attend to prompt-irrelevant global features instead of prompt-relevant local features, undermining their visual grounding capacity and leading to object hallucinations. We propose Assembly of Global and Local Attention (AGLA), a training-free and plug-and-play approach that mitigates hallucinations by assembling global features for response generation and local features for visual discrimination simultaneously. Specifically, we introduce an image-prompt matching scheme that captures prompt-relevant local features from images, leading to an augmented view of the input image where prompt-relevant content is highlighted while irrelevant distractions are suppressed. Hallucinations can thus be mitigated with a calibrated logit distribution that is from generative global features of the original image and discriminative local features of the augmented image. Extensive experiments show the superiority of AGLA in LVLM hallucination mitigation, demonstrating its wide applicability across both discriminative and generative tasks. Our code is available at https://github.com/Lackel/AGLA.
翻译:尽管大型视觉语言模型(LVLMs)在各种多模态任务中取得了巨大成功,但其生成的文本响应常与图像中实际物体不一致,出现物体幻觉现象。本研究通过分析不同LVLM发现,物体幻觉的根本原因之一在于对判别性图像特征的注意力分配不足。具体而言,LVLMs往往过度关注与提示无关的全局特征,而忽视与提示相关的局部特征,这削弱了其视觉定位能力并导致物体幻觉。我们提出全局与局部注意力融合方法(AGLA),这是一种无需训练即插即用的解决方案,通过同时整合用于响应生成的全局特征和用于视觉判别的局部特征来缓解幻觉问题。具体实现中,我们设计了图像-提示匹配机制,从图像中提取与提示相关的局部特征,从而生成增强版输入图像——其中相关内容被突出显示而无关干扰被抑制。通过融合原始图像的生成性全局特征与增强图像的判别性局部特征,我们得到校准后的对数分布,从而有效缓解幻觉现象。大量实验证明AGLA在LVLM幻觉缓解方面的优越性,展示了其在判别性和生成性任务中的广泛适用性。代码已开源:https://github.com/Lackel/AGLA。