Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence.
翻译:多模态大型语言模型在联合理解文本、图像和视频方面展现了卓越能力,通常通过视觉问答进行评测。然而,即便是最先进的MLLMs在处理领域特定或知识密集型查询时仍存在困难,这类查询的相关信息在预训练数据中占比较低。知识型视觉问答通过检索外部文档来约束答案生成以解决此问题,但当前检索增强方法存在精度低、检索段落噪声大及推理能力有限等缺陷。为此,我们提出ReAG——一种新型推理增强多模态RAG方法,该方法结合粗粒度与细粒度检索,并引入判别模型过滤无关段落,确保获得高质量的外部上下文。该模型采用多阶段训练策略,利用强化学习增强对检索内容的推理能力,而监督微调仅作为冷启动手段。在Encyclopedic-VQA和InfoSeek上的大量实验表明,ReAG显著优于先前方法,在提升答案准确性的同时,提供了基于检索证据的可解释推理过程。