Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module. This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. In the first stage, the visual knowledge module and abstractor module are trained with a frozen LLM module to align the image and text. In the second stage, language-only and multi-modal supervised datasets are used to jointly fine-tune a low-rank adaption (LoRA) module on LLM and the abstractor module by freezing the visual knowledge module. We carefully build a visually-related instruction evaluation set OwlEval. Experimental results show that our model outperforms existing multi-modal models, demonstrating mPLUG-Owl's impressive instruction and visual understanding ability, multi-turn conversation ability, and knowledge reasoning ability. Besides, we observe some unexpected and exciting abilities such as multi-image correlation and scene text understanding, which makes it possible to leverage it for harder real scenarios, such as vision-only document comprehension. Our code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github.com/X-PLUG/mPLUG-Owl. The online demo is available at https://www.modelscope.cn/studios/damo/mPLUG-Owl.
翻译:大型语言模型(LLMs)在多种开放式任务上展现出令人印象深刻的零样本能力,而近期研究也探索了利用LLMs进行多模态生成。在本研究中,我们提出mPLUG-Owl,一种新的训练范式,通过模块化学习基础LLM、视觉知识模块和视觉抽象器模块,赋予LLMs多模态能力。该方法可支持多种模态,并通过模态协作促进多样的单模态和多模态能力。mPLUG-Owl的训练范式采用两阶段方法实现图像与文本的对齐,在保持甚至提升LLM生成能力的同时,借助LLM学习视觉知识。第一阶段,视觉知识模块和抽象器模块在冻结LLM模块的条件下进行训练,以对齐图像与文本。第二阶段,仅使用语言和多模态监督数据集,在冻结视觉知识模块的情况下,对LLM上的低秩适配(LoRA)模块和抽象器模块进行联合微调。我们精心构建了一个视觉相关指令评估集OwlEval。实验结果表明,我们的模型优于现有的多模态模型,展示了mPLUG-Owl卓越的指令与视觉理解能力、多轮对话能力以及知识推理能力。此外,我们还观察到一些意外且令人兴奋的能力,如多图像关联和场景文本理解,这使其可以应用于更复杂的现实场景,例如仅基于视觉的文档理解。我们的代码、预训练模型、指令微调模型及评估集已发布于https://github.com/X-PLUG/mPLUG-Owl,在线演示可在https://www.modelscope.cn/studios/damo/mPLUG-Owl获取。