Multimodal Large Language Models (MLLMs) have showcased impressive skills in tasks related to visual understanding and reasoning. Yet, their widespread application faces obstacles due to the high computational demands during both the training and inference phases, restricting their use to a limited audience within the research and user communities. In this paper, we investigate the design aspects of Multimodal Small Language Models (MSLMs) and propose an efficient multimodal assistant named Mipha, which is designed to create synergy among various aspects: visual representation, language models, and optimization strategies. We show that without increasing the volume of training data, our Mipha-3B outperforms the state-of-the-art large MLLMs, especially LLaVA-1.5-13B, on multiple benchmarks. Through detailed discussion, we provide insights and guidelines for developing strong MSLMs that rival the capabilities of MLLMs. Our code is available at https://github.com/zhuyiche/Mipha.
翻译:多模态大语言模型(MLLMs)在视觉理解与推理相关任务中展现了卓越能力。然而,其训练与推理阶段的高计算需求严重阻碍了广泛部署,导致该技术仅限于研究及用户群体中的少数人群使用。本文系统探究了多模态小语言模型(MSLMs)的设计要素,提出了一种名为Mipha的高效多模态助手,该模型在视觉表征、语言模型与优化策略三方面实现了协同创新。实验表明,在训练数据量未增加的前提下,我们的Mipha-3B模型在多项基准测试中超越了当前最先进的大规模多模态模型(尤其优于LLaVA-1.5-13B)。通过深入分析,我们为开发性能媲美MLLMs的强健MSLMs提供了见解与设计指南。相关代码已公开于https://github.com/zhuyiche/Mipha。