Multi-modal Retrieval-Augmented Generation (RAG) has become a critical method for empowering LLMs by leveraging candidate visual documents. However, current methods consider the entire document as the basic retrieval unit, introducing substantial irrelevant visual content in two ways: 1) Relevant documents often contain large regions unrelated to the query, diluting the focus on salient information; 2) Retrieving multiple documents to increase recall further introduces redundant and irrelevant documents. These redundant contexts distract the model's attention and further degrade the performance. To address this challenge, we propose \modelname, a novel framework that shifts the retrieval paradigm from the document level to the region level. During training, we design a hybrid supervision strategy from both labeled data and unlabeled data to pinpoint relevant patches. During inference, we propose a dynamic pipeline that intelligently groups salient patches into complete semantic regions. By delegating the task of identifying relevant regions to the retriever, \modelname enables the generator to focus solely on concise visual content relevant to queries, improving both efficiency and accuracy. Experiments on six benchmarks demonstrate that RegionRAG achieves state-of-the-art performance. Improves retrieval accuracy by 10.02\% in R@1 on average and increases question answering accuracy by 3.56\% while using only 71.42\% visual tokens compared to prior methods. The code will be available at https://github.com/Aeryn666/RegionRAG.
翻译:多模态检索增强生成已成为通过利用候选视觉文档赋能大语言模型的关键方法。然而,现有方法将整个文档视为基本检索单元,在两方面引入了大量无关视觉内容:1)相关文档常包含与查询无关的大片区域,冲淡了对显著信息的关注;2)为提升召回率而检索多个文档会进一步引入冗余且无关的文档。这些冗余上下文会分散模型的注意力,进而降低性能。为应对这一挑战,我们提出 RegionRAG,一种将检索范式从文档级转向区域级的新框架。在训练阶段,我们设计了一种结合标注数据与未标注数据的混合监督策略,以精确定位相关图像块。在推理阶段,我们提出一种动态流程,能够智能地将显著图像块聚合为完整的语义区域。通过将识别相关区域的任务委托给检索器,RegionRAG 使生成器能够仅专注于与查询相关的简洁视觉内容,从而提升效率与准确性。在六个基准测试上的实验表明,RegionRAG 实现了最先进的性能:平均在 R@1 上检索准确率提升 10.02%,问答准确率提高 3.56%,同时视觉令牌使用量仅为先前方法的 71.42%。代码将在 https://github.com/Aeryn666/RegionRAG 发布。