Existing Large Vision-Language Models (LVLMs) exhibit insufficient visual attention, leading to hallucinations. To alleviate this problem, some previous studies adjust and amplify visual attention. These methods present a limitation that boosting attention for all visual tokens inevitably increases attention to task irrelevant tokens. To tackle this challenge, we propose a training free attentional intervention algorithm to enhance the attention of task-relevant tokens based on the argument that task-relevant tokens generally demonstrate high visual-textual similarities. Specifically, the vision-text cross-attention submatrices, which represent visual-textual correlations, are extracted to construct the reweighting matrices to reallocate attention. Besides, to enhance the contribution of visual tokens, we inject visual attention values into the beam search decoding to identify solutions with higher visual attention. Extensive experiments demonstrate that this method significantly reduces hallucinations across mainstream LVLMs, while preserving the accuracy and coherence of generated content.
翻译:现有的大型视觉语言模型(LVLM)表现出视觉注意力不足,导致幻觉现象。为缓解此问题,先前研究通过调整和增强视觉注意力进行改进。这些方法存在局限:提升所有视觉标记的注意力不可避免地会增强对任务无关标记的关注。为解决这一挑战,我们提出一种无需训练的注意力干预算法,基于任务相关标记通常具有较高视觉-文本相似度的论点,增强任务相关标记的注意力。具体而言,提取代表视觉-文本相关性的视觉-文本交叉注意力子矩阵,构建重加权矩阵以重新分配注意力。此外,为增强视觉标记的贡献,我们将视觉注意力值注入束搜索解码过程,以识别具有更高视觉注意力的解决方案。大量实验表明,该方法显著减少了主流LVLM中的幻觉现象,同时保持了生成内容的准确性与连贯性。