The integration of large language models (LLMs) with vision-language (VL) tasks has been a transformative development in the realm of artificial intelligence, highlighting the potential of LLMs as a versatile general-purpose chatbot. However, the current trend in this evolution focuses on the integration of vision and language to create models that can operate in more diverse and real-world contexts. We present a novel approach, termed Bottleneck Adapter, specifically crafted for enhancing the multimodal functionalities of these complex models, enabling joint optimization of the entire multimodal LLM framework through a process known as Multimodal Model Tuning (MMT). Our approach utilizes lightweight adapters to connect the image encoder and LLM without the need for large, complex neural networks. Unlike the conventional modular training schemes, our approach adopts an end-to-end optimization regime, which, when combined with the adapters, facilitates the joint optimization using a significantly smaller parameter set. Our method exhibits robust performance with 90.12\% accuracy, outperforming both human-level performance (88.4\%) and LaVIN-7B (89.41\%).
翻译:大型语言模型(LLM)与视觉-语言(VL)任务的融合已成为人工智能领域的一项变革性进展,突显了LLM作为多功能通用聊天机器人的潜力。然而,当前这一演进趋势聚焦于视觉与语言的整合,以构建能够在更多样化和真实世界场景中运作的模型。我们提出了一种新颖的方法,称为Bottleneck Adapter,专为增强这些复杂模型的多模态功能而设计,通过一个称为多模态模型调优(MMT)的过程,实现对整个多模态LLM框架的联合优化。我们的方法利用轻量级适配器连接图像编码器和LLM,无需庞大复杂的神经网络。与传统的模块化训练方案不同,我们的方法采用端到端的优化机制,当与适配器结合时,能够以显著更少的参数量促进联合优化。我们的方法展现出鲁棒的性能,准确率达到90.12%,超越了人类水平表现(88.4%)和LaVIN-7B(89.41%)。