Multimodal Large Language Models (MLLMs) have showcased impressive skills in tasks related to visual understanding and reasoning. Yet, their widespread application faces obstacles due to the high computational demands during both the training and inference phases, restricting their use to a limited audience within the research and user communities. In this paper, we investigate the design aspects of Multimodal Small Language Models (MSLMs) and propose an efficient multimodal assistant named Mipha, which is designed to create synergy among various aspects: visual representation, language models, and optimization strategies. We show that without increasing the volume of training data, our Mipha-3B outperforms the state-of-the-art large MLLMs, especially LLaVA-1.5-13B, on multiple benchmarks. Through detailed discussion, we provide insights and guidelines for developing strong MSLMs that rival the capabilities of MLLMs. Our code is available at https://github.com/zhuyiche/llava-phi.
翻译:多模态大语言模型在视觉理解与推理相关任务中展现了卓越能力。然而,其训练与推理阶段的高计算需求严重阻碍了广泛应用,导致该技术仅能服务于研究与用户社区中有限人群。本文系统探究多模态小语言模型的设计维度,并提出名为Mipha的高效多模态助手——该模型旨在视觉表征、语言模型与优化策略三个层面实现协同增效。研究表明,在不扩充训练数据规模的前提下,我们的Mipha-3B在多个基准测试中超越现有先进大模型(尤其是LLaVA-1.5-13B)。通过详尽讨论,我们为构建能与多模态大语言模型能力匹敌的强效多模态小语言模型提供了洞见与设计指南。本模型代码已开源于https://github.com/zhuyiche/llava-phi。