Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

Recently, growing interest has been aroused in extending the multimodal capability of large language models (LLMs), e.g., vision-language (VL) learning, which is regarded as the next milestone of artificial general intelligence. However, existing solutions are prohibitively expensive, which not only need to optimize excessive parameters, but also require another large-scale pre-training before VL instruction tuning. In this paper, we propose a novel and affordable solution for the effective VL adaption of LLMs, called Mixture-of-Modality Adaptation (MMA). Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters, to bridge the gap between LLMs and VL tasks, which also enables the joint optimization of the image and language models. Meanwhile, MMA is also equipped with a routing algorithm to help LLMs achieve an automatic shift between single- and multi-modal instructions without compromising their ability of natural language understanding. To validate MMA, we apply it to a recent LLM called LLaMA and term this formed large vision-language instructed model as LaVIN. To validate MMA and LaVIN, we conduct extensive experiments under two setups, namely multimodal science question answering and multimodal dialogue. The experimental results not only demonstrate the competitive performance and the superior training efficiency of LaVIN than existing multimodal LLMs, but also confirm its great potential as a general-purpose chatbot. More importantly, the actual expenditure of LaVIN is extremely cheap, e.g., only 1.4 training hours with 3.8M trainable parameters, greatly confirming the effectiveness of MMA. Our project is released at https://luogen1996.github.io/lavin.

翻译：近期，拓展大语言模型多模态能力（如视觉语言学习）的兴趣日益增长，这被视为人工通用智能的下一个里程碑。然而现有解决方案代价高昂，不仅需要优化过多参数，在视觉语言指令微调前还需进行大规模预训练。本文提出名为混合模态适配（MMA）的新型低成本方案，用于实现大语言模型的高效视觉语言适配。相较于使用大型神经网络连接图像编码器与大语言模型，MMA采用轻量级适配器模块弥合二者与视觉语言任务间的鸿沟，同时实现图像与语言模型的联合优化。此外，MMA还配备路由算法，帮助大语言模型在不损失自然语言理解能力的前提下，自动实现单模态与多模态指令间的切换。为验证MMA，我们将其应用于近期推出的LLaMA大语言模型，并将由此形成的大型视觉语言指令模型命名为LaVIN。我们分别在多模态科学问答与多模态对话两种场景下开展广泛实验，结果不仅证明LaVIN相比现有多模态大语言模型具有竞争性性能与卓越训练效率，更证实其作为通用聊天机器人的巨大潜力。更重要的是，LaVIN的实际开销极为低廉——仅需1.4小时训练时长与3.8M可训练参数，充分验证了MMA的有效性。项目开源地址：https://luogen1996.github.io/lavin