While Retrieval-Augmented Generation (RAG) is one of the dominant paradigms for enhancing Large Vision-Language Models (LVLMs) on knowledge-based VQA tasks, recent work attributes RAG failures to insufficient attention towards the retrieved context, proposing to reduce the attention allocated to image tokens. In this work, we identify a distinct failure mode that previous study overlooked: Attention Distraction (AD). When the retrieved context is sufficient (highly relevant or including the correct answer), the retrieved text suppresses the visual attention globally, and the attention on image tokens shifts away from question-relevant regions. This leads to failures on questions the model could originally answer correctly without the retrieved text. To mitigate this issue, we propose MAD-RAG, a training-free intervention that decouples visual grounding from context integration through a dual-question formulation, combined with attention mixing to preserve image-conditioned evidence. Extensive experiments on OK-VQA, E-VQA, and InfoSeek demonstrate that MAD-RAG consistently outperforms existing baselines across different model families, yielding absolute gains of up to 4.76%, 9.20%, and 6.18% over the vanilla RAG baseline. Notably, MAD-RAG rectifies up to 74.68% of failure cases with negligible computational overhead.
翻译:尽管检索增强生成(RAG)是增强大型视觉语言模型(LVLMs)在基于知识的视觉问答任务中的主导范式之一,近期研究将RAG失败归因于对检索上下文注意力不足,并提出减少分配给图像标记的注意力。本文识别出前人研究忽略的一种独特失效模式:注意力分散(AD)。当检索上下文足够充分(高度相关或包含正确答案)时,检索文本会全局抑制视觉注意力,使得图像标记上的注意力从问题相关区域转移。这导致模型在原本无需检索文本即可正确回答的问题上出现失败。为解决此问题,我们提出MAD-RAG——一种无需训练的干预方法,通过双问题公式化将视觉定位与上下文整合解耦,并结合注意力混合以保留基于图像的证据。在OK-VQA、E-VQA和InfoSeek上的大量实验表明,MAD-RAG在不同模型系列中持续优于现有基线,在原始RAG基线之上分别取得高达4.76%、9.20%和6.18%的绝对增益。值得注意的是,MAD-RAG能以可忽略的计算开销纠正高达74.68%的失败案例。