Recent advancements in multimodal large language models (MLLMs) have achieved significant multimodal generation capabilities, akin to GPT-4. These models predominantly map visual information into language representation space, leveraging the vast knowledge and powerful text generation abilities of LLMs to produce multimodal instruction-following responses. We could term this method as LLMs for Vision because of its employing LLMs for visual-language understanding, yet observe that these MLLMs neglect the potential of harnessing visual knowledge to enhance overall capabilities of LLMs, which could be regraded as Vision Enhancing LLMs. In this paper, we propose an approach called MKS2, aimed at enhancing LLMs through empowering Multimodal Knowledge Storage and Sharing in LLMs. Specifically, we introduce the Modular Visual Memory, a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently. Additionally, we present a soft Mixtures-of-Multimodal Experts architecture in LLMs to invoke multimodal knowledge collaboration during generation. Our comprehensive experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge. It also delivers competitive results on multimodal benchmarks.
翻译:近期多模态大语言模型(MLLMs)取得了显著进展,展现出类似GPT-4的多模态生成能力。这些模型主要将视觉信息映射至语言表征空间,借助大语言模型的庞大知识储备与强大文本生成能力,生成多模态指令遵循响应。我们可将其称为"面向视觉的大语言模型"(LLMs for Vision)方法,因其利用大语言模型进行视觉-语言理解。然而,我们观察到这些多模态大语言模型忽视了利用视觉知识增强大语言模型整体能力的潜力,这一方向可视为"视觉增强型大语言模型"(Vision Enhancing LLMs)。本文提出名为MKS2的方法,旨在通过赋能大语言模型中的多模态知识存储与共享来增强其能力。具体而言,我们引入模块化视觉记忆(Modular Visual Memory)组件,将其集成至大语言模型内部模块,以高效存储开放世界视觉信息。同时,我们设计了大语言模型中的软多模态专家混合(soft Mixtures-of-Multimodal Experts)架构,在生成过程中激活多模态知识协同。综合实验表明,MKS2在需要物理常识或常识推理的场景下,显著提升了大语言模型的推理能力,并在多模态基准测试中取得具有竞争力的结果。