Large language models equipped with retrieval-augmented generation (RAG) represent a burgeoning field aimed at enhancing answering capabilities by leveraging external knowledge bases. Although the application of RAG with language-only models has been extensively explored, its adaptation into multimodal vision-language models remains nascent. Going beyond mere answer generation, the primary goal of multimodal RAG is to cultivate the models' ability to reason in response to relevant queries. To this end, we introduce a novel multimodal RAG framework named RMR (Retrieval Meets Reasoning). The RMR framework employs a bi-modal retrieval module to identify the most relevant question-answer pairs, which then serve as scaffolds for the multimodal reasoning process. This training-free approach not only encourages the model to engage deeply with the reasoning processes inherent in the retrieved content but also facilitates the generation of answers that are precise and richly interpretable. Surprisingly, utilizing solely the ScienceQA dataset, collected from elementary and high school science curricula, RMR significantly boosts the performance of various vision-language models across a spectrum of benchmark datasets, including A-OKVQA, MMBench, and SEED. These outcomes highlight the substantial potential of our multimodal retrieval and reasoning mechanism to improve the reasoning capabilities of vision-language models.
翻译:配备检索增强生成(RAG)的大型语言模型是一个新兴领域,旨在通过利用外部知识库来增强问答能力。尽管纯语言模型与RAG的结合应用已得到广泛探索,但其在多模态视觉语言模型中的适配仍处于起步阶段。超越单纯的答案生成,多模态RAG的主要目标是培养模型针对相关查询进行推理的能力。为此,我们引入了一种名为RMR(检索与推理相遇)的新型多模态RAG框架。该框架采用双模态检索模块来识别最相关的问答对,这些问答对随后作为多模态推理过程的支架。这种无需训练的方法不仅鼓励模型深入参与检索内容中固有的推理过程,还促进了精确且富含可解释性的答案生成。令人惊讶的是,仅使用从小学和高中科学课程中收集的ScienceQA数据集,RMR就能显著提升多种视觉语言模型在一系列基准数据集(包括A-OKVQA、MMBench和SEED)上的性能。这些结果凸显了我们提出的多模态检索与推理机制在提升视觉语言模型推理能力方面的巨大潜力。