Multi-modal Retrieval-Augmented Generation (RAG) has emerged as a highly effective paradigm for Knowledge-Based Visual Question Answering (KB-VQA). Despite recent advancements, prevailing methods still primarily depend on images as the retrieval key, and often overlook or misplace the role of Vision-Language Models (VLMs), thereby failing to leverage their potential fully. In this paper, we introduce WikiSeeker, a novel multi-modal RAG framework that bridges these gaps by proposing a multi-modal retriever and redefining the role of VLMs. Rather than serving merely as answer generators, we assign VLMs two specialized agents: a Refiner and an Inspector. The Refiner utilizes the capability of VLMs to rewrite the textual query according to the input image, significantly improving the performance of the multimodal retriever. The Inspector facilitates a decoupled generation strategy by selectively routing reliable retrieved context to another LLM for answer generation, while relying on the VLM's internal knowledge when retrieval is unreliable. Extensive experiments on EVQA, InfoSeek, and M2KR demonstrate that WikiSeeker achieves state-of-the-art performance, with substantial improvements in both retrieval accuracy and answer quality. Our code will be released on https://github.com/zhuyjan/WikiSeeker.
翻译:多模态检索增强生成(Multi-modal Retrieval-Augmented Generation, RAG)已成为基于知识的视觉问答(Knowledge-Based Visual Question Answering, KB-VQA)领域一种高效范式。尽管近期取得了进展,现有方法仍主要依赖图像作为检索键,且常常忽视或误置视觉-语言模型(Vision-Language Models, VLMs)的作用,未能充分挖掘其潜力。本文提出WikiSeeker这一新型多模态RAG框架,通过引入多模态检索器并重新定义VLMs的角色来填补这些空白。我们不再将VLMs仅作为答案生成器,而是为其分配两个专用代理:**优化器(Refiner)**与**检查器(Inspector)**。优化器利用VLMs的能力根据输入图像重写文本查询,显著提升多模态检索器的性能;检查器通过选择性路由可靠检索上下文至另一大语言模型(LLM)进行答案生成,同时在检索不可靠时依赖VLM的内部知识,从而促进解耦式生成策略。在EVQA、InfoSeek和M2KR上的大量实验表明,WikiSeeker取得了最先进性能,在检索准确率和答案质量上均有显著提升。我们的代码将发布于https://github.com/zhuyjan/WikiSeeker。