We present MobileVLM, a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector. We evaluate MobileVLM on several typical VLM benchmarks. Our models demonstrate on par performance compared with a few much larger models. More importantly, we measure the inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens per second, respectively. Our code will be made available at: https://github.com/Meituan-AutoML/MobileVLM.
翻译:我们提出MobileVLM,一种专为移动设备设计的高性能多模态视觉语言模型(MMVLM)。该模型融合了众多面向移动端的架构设计与技术,包括一组从零训练的1.4B和2.7B参数规模的語言模型、基于CLIP方式预训练的多模态视觉模型,以及通过高效投影器实现的跨模态交互。我们在多个典型VLM基准上对MobileVLM进行了评估。实验表明,我们的模型在与若干参数量更大的模型对比时展现出相当的性能。更重要的是,我们在高通骁龙888 CPU和NVIDIA Jetson Orin GPU上测量了推理速度,分别达到每秒21.5个token和65.3个token的当前最优性能。我们的代码将开源在:https://github.com/Meituan-AutoML/MobileVLM。