Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters, in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach, termed Wiki-LLaVA, aims at integrating an external knowledge source of multimodal documents, which is accessed through a hierarchical retrieval pipeline. Relevant passages, using this approach, are retrieved from the external knowledge source and employed as additional context for the LLM, augmenting the effectiveness and precision of generated dialogues. We conduct extensive experiments on datasets tailored for visual question answering with external data and demonstrate the appropriateness of our approach.
翻译:多模态大语言模型是大语言模型的自然演进,其能力边界得以拓展至纯文本模态之外。当前研究主要聚焦于新型架构设计与视觉-语言适配器开发,而本文致力于赋予这类模型回答需要外部知识的问题的能力。我们提出的方法名为Wiki-LLaVA,旨在通过层级检索流水线访问外部多模态文档知识源。采用该方法时,从外部知识源中检索相关段落并将其作为大语言模型的额外上下文,从而增强生成对话的有效性与精准度。我们在面向外部数据支持的视觉问答任务数据集上开展广泛实验,验证了该方法的适用性。