Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in processing and generating content across multiple data modalities. However, a significant drawback of MLLMs is their reliance on static training data, leading to outdated information and limited contextual awareness. This static nature hampers their ability to provide accurate and up-to-date responses, particularly in dynamic or rapidly evolving contexts. Though integrating Multimodal Retrieval-augmented Generation (Multimodal RAG) offers a promising solution, the system would inevitably encounter the multi-granularity noisy correspondence (MNC) problem, which hinders accurate retrieval and generation. In this work, we propose RagVL, a novel framework with knowledge-enhanced reranking and noise-injected training, to address these limitations. We instruction-tune the MLLM with a simple yet effective instruction template to induce its ranking ability and serve it as a reranker to precisely filter the top-k retrieved images. For generation, we inject visual noise during training at the data and token levels to enhance the generator's robustness. Extensive experiments on the subsets of two datasets that require retrieving and reasoning over images to answer a given query verify the effectiveness of our method. Code and models are available at https://github.com/IDEA-FinAI/RagVL.
翻译:多模态大语言模型(MLLMs)在处理和生成跨多数据模态内容方面展现出卓越能力。然而,MLLMs的一个显著缺陷在于其依赖静态训练数据,导致信息过时且上下文感知有限。这种静态特性阻碍了其提供准确、最新响应的能力,尤其在动态或快速演变的语境中。尽管集成多模态检索增强生成(多模态RAG)提供了有前景的解决方案,但系统仍不可避免地遭遇多粒度噪声对应(MNC)问题,从而影响检索与生成的准确性。本研究提出RagVL——一种融合知识增强重排序与噪声注入训练的新型框架,以应对这些局限。我们通过简洁高效的指令模板对MLLM进行指令微调,以激发其排序能力,并将其作为重排序器精准筛选前k个检索图像。在生成阶段,我们在训练过程中从数据和标记层面注入视觉噪声,以增强生成器的鲁棒性。在两个需要检索并基于图像推理以回答查询的数据集子集上进行的大量实验,验证了我们方法的有效性。代码与模型已发布于https://github.com/IDEA-FinAI/RagVL。