In recent years, multimodal large language models (MLLMs) such as GPT-4V have demonstrated remarkable advancements, excelling in a variety of vision-language tasks. Despite their prowess, the closed-source nature and computational demands of such models limit their accessibility and applicability. This study introduces TinyGPT-V, a novel open-source MLLM, designed for efficient training and inference across various vision-language tasks, including image captioning (IC) and visual question answering (VQA). Leveraging a compact yet powerful architecture, TinyGPT-V integrates the Phi-2 language model with pre-trained vision encoders, utilizing a unique mapping module for visual and linguistic information fusion. With a training regimen optimized for small backbones and employing a diverse dataset amalgam, TinyGPT-V requires significantly lower computational resources 24GB for training and as little as 8GB for inference without compromising on performance. Our experiments demonstrate that TinyGPT-V, with its language model 2.8 billion parameters, achieves comparable results in VQA and image inference tasks to its larger counterparts while being uniquely suited for deployment on resource-constrained devices through innovative quantization techniques. This work not only paves the way for more accessible and efficient MLLMs but also underscores the potential of smaller, optimized models in bridging the gap between high performance and computational efficiency in real-world applications. Additionally, this paper introduces a new approach to multimodal large language models using smaller backbones. Our code and training weights are available in the supplementary material.
翻译:近年来,诸如GPT-4V等多模态大语言模型(MLLMs)展现出显著进步,在各类视觉-语言任务中表现卓越。尽管性能强大,此类模型的闭源特性与计算需求限制了其可及性与适用性。本研究提出TinyGPT-V,一种新颖的开源MLLM,专为图像描述(IC)和视觉问答(VQA)等多种视觉-语言任务的高效训练与推理而设计。通过采用紧凑而强大的架构,TinyGPT-V将Phi-2语言模型与预训练视觉编码器相结合,并利用独特的映射模块实现视觉与语言信息的融合。通过针对小型骨干网络优化的训练策略及采用多样化混合数据集,TinyGPT-V在保持性能的同时显著降低计算资源需求——训练仅需24GB显存,推理最低仅需8GB显存。实验表明,参数量为28亿的语言模型TinyGPT-V在VQA和图像推理任务中取得了与更大规模模型相当的结果,同时通过创新的量化技术特别适合部署在资源受限设备上。这项工作不仅为开发更易获取、更高效的MLLMs开辟了道路,也凸显了经过优化的小型模型在实际应用中弥合高性能与计算效率之间差距的潜力。此外,本文提出了一种基于小型骨干网络的多模态大语言模型新方法。我们的代码与训练权重已附于补充材料中。