In the era of advanced multimodel learning, multimodal large language models (MLLMs) such as GPT-4V have made remarkable strides towards bridging language and visual elements. However, the closed-source nature and considerable computational demand present notable challenges for universal usage and modifications. This is where open-source MLLMs like LLaVA and MiniGPT-4 come in, presenting groundbreaking achievements across tasks. Despite these accomplishments, computational efficiency remains an unresolved issue, as these models, like LLaVA-v1.5-13B, require substantial resources. Addressing these issues, we introduce TinyGPT-V, a new-wave model marrying impressive performance with commonplace computational capacity. It stands out by requiring merely a 24G GPU for training and an 8G GPU or CPU for inference. Built upon Phi-2, TinyGPT-V couples an effective language backbone with pre-trained vision modules from BLIP-2 or CLIP. TinyGPT-V's 2.8B parameters can undergo a unique quantisation process, suitable for local deployment and inference tasks on 8G various devices. Our work fosters further developments for designing cost-effective, efficient, and high-performing MLLMs, expanding their applicability in a broad array of real-world scenarios. Furthermore this paper proposed a new paradigm of Multimodal Large Language Model via small backbones. Our code and training weights are placed at: https://github.com/DLYuanGod/TinyGPT-V and https://huggingface.co/Tyrannosaurus/TinyGPT-V respectively.
翻译:在多模态学习的高级阶段,诸如GPT-4V等多模态大语言模型(MLLMs)在连接语言与视觉元素方面取得了显著进展。然而,其闭源性质和巨大的计算需求对通用使用和模型修改构成了显著挑战。开源MLLMs(如LLaVA和MiniGPT-4)应运而生,在各类任务上取得了突破性成果。尽管如此,计算效率问题仍未解决——例如LLaVA-v1.5-13B这类模型仍需大量资源。针对这些问题,我们提出TinyGPT-V——一种兼具卓越性能与常规计算能力的新一代模型。其突出优势在于:训练仅需24G GPU,推理仅需8G GPU或CPU。基于Phi-2构建,TinyGPT-V将高效语言骨干网络与BLIP-2或CLIP的预训练视觉模块相结合。TinyGPT-V的2.8B参数可通过独特的量化过程,适用于8G各类设备的本地部署与推理任务。本研究推动了成本效益高、高效且高性能MLLMs的设计发展,拓展了其在广泛真实场景中的适用性。此外,本文提出了一种基于轻量骨干网络的多模态大语言模型新范式。我们的代码和训练权重分别发布于:https://github.com/DLYuanGod/TinyGPT-V 和 https://huggingface.co/Tyrannosaurus/TinyGPT-V。