Multi-modal Retrieval-Augmented Generation (RAG) has emerged as a highly effective paradigm for Knowledge-Based Visual Question Answering (KB-VQA). Despite recent advancements, prevailing methods still primarily depend on images as the retrieval key, and often overlook or misplace the role of Vision-Language Models (VLMs), thereby failing to leverage their potential fully. In this paper, we introduce WikiSeeker, a novel multi-modal RAG framework that bridges these gaps by proposing a multi-modal retriever and redefining the role of VLMs. Rather than serving merely as answer generators, we assign VLMs two specialized agents: a Refiner and an Inspector. The Refiner utilizes the capability of VLMs to rewrite the textual query according to the input image, significantly improving the performance of the multimodal retriever. The Inspector facilitates a decoupled generation strategy by selectively routing reliable retrieved context to another LLM for answer generation, while relying on the VLM's internal knowledge when retrieval is unreliable. Extensive experiments on EVQA, InfoSeek, and M2KR demonstrate that WikiSeeker achieves state-of-the-art performance, with substantial improvements in both retrieval accuracy and answer quality. Our code will be released on https://github.com/zhuyjan/WikiSeeker.
翻译:多模态检索增强生成(RAG)已成为基于知识的视觉问答(KB-VQA)领域中一种高效范式。尽管近期取得了进展,现有方法仍主要依赖图像作为检索键,且常常忽视或误判视觉-语言模型(VLM)的作用,从而未能充分发挥其潜力。本文提出WikiSeeker——一种新颖的多模态RAG框架,通过引入多模态检索器并重新定义VLM的角色来弥合上述不足。我们不再将VLM仅仅作为答案生成器,而是为其赋予两个专门化智能体:精炼器(Refiner)与审查器(Inspector)。精炼器利用VLM的能力根据输入图像重写文本查询,显著提升多模态检索器的性能;审查器则通过选择性路由可靠检索上下文至另一大语言模型(LLM)进行答案生成,并在检索不可靠时依赖VLM的内部知识,从而促成解耦式生成策略。在EVQA、InfoSeek和M2KR上的大量实验表明,WikiSeeker达到了最先进的性能,在检索准确率与答案质量上均有显著提升。我们的代码将发布在https://github.com/zhuyjan/WikiSeeker。