Multimodal Large Language Models (MLLMs) have showcased impressive skills in tasks related to visual understanding and reasoning. Yet, their widespread application faces obstacles due to the high computational demands during both the training and inference phases, restricting their use to a limited audience within the research and user communities. In this paper, we investigate the design aspects of Multimodal Small Language Models (MSLMs) and propose an efficient multimodal assistant named Mipha, which is designed to create synergy among various aspects: visual representation, language models, and optimization strategies. We show that without increasing the volume of training data, our Mipha-3B outperforms the state-of-the-art large MLLMs, especially LLaVA-1.5-13B, on multiple benchmarks. Through detailed discussion, we provide insights and guidelines for developing strong MSLMs that rival the capabilities of MLLMs. Our code is available at https://github.com/zhuyiche/Mipha.
翻译:多模态大语言模型在视觉理解与推理相关任务中展现了卓越能力。然而,其训练与推理阶段的高计算需求阻碍了广泛应用,仅限研究界与用户社群中的少数群体使用。本文系统探究多模态小语言模型的设计维度,提出名为Mipha的高效多模态助手,该模型致力于在视觉表征、语言模型与优化策略之间构建协同效应。研究表明,在不增加训练数据规模的前提下,我们的Mipha-3B在多项基准测试中超越了现有最先进的多模态大语言模型,特别是LLaVA-1.5-13B。通过深入讨论,我们为构建能与多模态大语言模型能力相抗衡的强效多模态小语言模型提供了洞见与指导方针。我们的代码开源在https://github.com/zhuyiche/Mipha。