Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module. This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. In the first stage, the visual knowledge module and abstractor module are trained with a frozen LLM module to align the image and text. In the second stage, language-only and multi-modal supervised datasets are used to jointly fine-tune a low-rank adaption (LoRA) module on LLM and the abstractor module by freezing the visual knowledge module. We carefully build a visually-related instruction evaluation set OwlEval. Experimental results show that our model outperforms existing multi-modal models, demonstrating mPLUG-Owl's impressive instruction and visual understanding ability, multi-turn conversation ability, and knowledge reasoning ability. Besides, we observe some unexpected and exciting abilities such as multi-image correlation and scene text understanding, which makes it possible to leverage it for harder real scenarios, such as vision-only document comprehension. Our code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github.com/X-PLUG/mPLUG-Owl. The online demo is available at https://www.modelscope.cn/studios/damo/mPLUG-Owl.
翻译:大语言模型(LLMs)已在多种开放式任务中展现出令人印象深刻的零样本能力,近期研究也开始探索将LLMs用于多模态生成。在本研究中,我们提出mPLUG-Owl,一种通过模块化学习基础LLM、视觉知识模块和视觉抽象模块来赋予LLM多模态能力的全新训练范式。该方法可支持多种模态,并通过模态协作促进多样化的单模态与多模态能力。mPLUG-Owl的训练范式采用两阶段方法对齐图像和文本:在LLM的辅助下学习视觉知识,同时保持甚至提升LLM的生成能力。第一阶段,在冻结LLM模块的情况下训练视觉知识模块和抽象模块,以对齐图像和文本;第二阶段,利用纯语言和多模态监督数据集,在冻结视觉知识模块的情况下联合微调LLM上的低秩适配模块和抽象模块。我们精心构建了视觉相关指令评估集OwlEval。实验结果表明,我们的模型优于现有多模态模型,充分展示了mPLUG-Owl在指令理解、视觉理解、多轮对话及知识推理方面的能力。此外,我们观察到一些意外而令人兴奋的能力,如多图像关联和场景文本理解,这使得模型可应用于更复杂的真实场景,例如纯视觉文档理解。我们的代码、预训练模型、指令微调模型和评估集已开源至https://github.com/X-PLUG/mPLUG-Owl,在线演示见https://www.modelscope.cn/studios/damo/mPLUG-Owl。