AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

Despite their great success across various multimodal tasks, Large Vision-Language Models (LVLMs) are facing a prevalent problem with object hallucinations, where the generated textual responses are inconsistent with ground-truth objects in the given image. This paper investigates various LVLMs and pinpoints attention deficiency toward discriminative local image features as one root cause of object hallucinations. Specifically, LVLMs predominantly attend to prompt-independent global image features, while failing to capture prompt-relevant local features, consequently undermining the visual grounding capacity of LVLMs and leading to hallucinations. To this end, we propose Assembly of Global and Local Attention (AGLA), a training-free and plug-and-play approach that mitigates object hallucinations by exploring an ensemble of global features for response generation and local features for visual discrimination simultaneously. Our approach exhibits an image-prompt matching scheme that captures prompt-relevant local features from images, leading to an augmented view of the input image where prompt-relevant content is reserved while irrelevant distractions are masked. With the augmented view, a calibrated decoding distribution can be derived by integrating generative global features from the original image and discriminative local features from the augmented image. Extensive experiments show that AGLA consistently mitigates object hallucinations and enhances general perception capability for LVLMs across various discriminative and generative benchmarks. Our code will be released at https://github.com/Lackel/AGLA.

翻译：尽管大型视觉语言模型（LVLMs）在各种多模态任务中取得了巨大成功，但其普遍存在物体幻觉问题，即生成的文本响应与给定图像中的真实物体不一致。本文通过研究多种LVLM，指出对判别性局部图像特征的注意力缺失是导致物体幻觉的根本原因之一。具体而言，LVLMs主要关注与提示无关的全局图像特征，而未能捕捉与提示相关的局部特征，这削弱了LVLMs的视觉基础能力，进而导致幻觉现象。为此，我们提出全局与局部注意力集成（AGLA），这是一种无需训练即插即用的方法，通过同时利用全局特征进行响应生成和局部特征进行视觉判别来缓解物体幻觉。我们的方法采用图像-提示匹配机制，从图像中提取与提示相关的局部特征，从而生成输入图像的增强视图：保留与提示相关的内容，同时屏蔽无关干扰。基于该增强视图，通过整合原始图像的生成性全局特征与增强图像的判别性局部特征，可推导出校准的解码分布。大量实验表明，AGLA在各种判别性和生成性基准测试中，能持续缓解物体幻觉并增强LVLMs的通用感知能力。代码将在 https://github.com/Lackel/AGLA 发布。