Multimodal Large Language Models (MLLMs) have demonstrated notable capabilities in general visual understanding and reasoning tasks. However, their deployment is hindered by substantial computational costs in both training and inference, limiting accessibility to the broader research and user communities. A straightforward solution is to leverage smaller pre-trained vision and language models, which inevitably cause significant performance drops. In this paper, we demonstrate the possibility of training a smaller but better MLLM with high-quality training data. Specifically, we introduce Bunny, a family of lightweight MLLMs with flexible vision and language backbones for efficient multimodal learning from selected training data. Experiments show that our Bunny-4B/8B outperforms the state-of-the-art large MLLMs on multiple benchmarks. We expect that this work can provide the community with a clean and flexible open-source tool for further research and development. The code, models, and data can be found in https://github.com/BAAI-DCAI/Bunny.
翻译:多模态大语言模型(MLLMs)在通用视觉理解与推理任务中已展现出显著能力。然而,其部署因训练与推理过程中巨大的计算成本而受到阻碍,限制了更广泛研究群体和用户的可及性。一种直接的解决方案是利用较小的预训练视觉与语言模型,但这不可避免地会导致性能显著下降。本文论证了通过高质量训练数据训练更小但性能更优的MLLM的可能性。具体而言,我们提出了Bunny——一个轻量级MLLM系列,其具备灵活的视觉与语言骨干网络,能够基于精选的训练数据实现高效多模态学习。实验表明,我们的Bunny-4B/8B模型在多个基准测试中超越了当前最先进的大型MLLM。我们期望这项工作能为学界提供一个简洁灵活的开源工具,以促进进一步的研究与开发。相关代码、模型及数据可在 https://github.com/BAAI-DCAI/Bunny 获取。