We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers. Our work directly confronts a pivotal industry issue by grappling with the prohibitive service costs that hinder the broad adoption of large-scale multimodal systems. Through rigorous training, we have developed a 1B-scale language model from the ground up, employing the LLaVA paradigm for modal alignment. The result, which we call Xmodel-VLM, is a lightweight yet powerful multimodal vision language model. Extensive testing across numerous classic multimodal benchmarks has revealed that despite its smaller size and faster execution, Xmodel-VLM delivers performance comparable to that of larger models. Our model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelVLM.
翻译:我们介绍了Xmodel-VLM,一个先进的多模态视觉语言模型。该模型专为在消费级GPU服务器上高效部署而设计。我们的工作直面一个关键的行业问题,即应对阻碍大规模多模态系统广泛采用的过高服务成本。通过严格的训练,我们从头开始构建了一个10亿参数规模的语言模型,并采用LLaVA范式进行模态对齐。其成果即我们称之为Xmodel-VLM的模型,它是一个轻量级但功能强大的多模态视觉语言模型。在众多经典多模态基准测试上的广泛评估表明,尽管Xmodel-VLM规模更小、运行速度更快,但其性能可与更大的模型相媲美。我们的模型检查点与代码已在GitHub上公开,地址为:https://github.com/XiaoduoAILab/XmodelVLM。