Multimodal large language models (MLLMs) have attracted widespread interest and have rich applications. However, the inherent attention mechanism in its Transformer structure requires quadratic complexity and results in expensive computational overhead. Therefore, in this work, we propose VL-Mamba, a multimodal large language model based on state space models, which have been shown to have great potential for long-sequence modeling with fast inference and linear scaling in sequence length. Specifically, we first replace the transformer-based backbone language model such as LLama or Vicuna with the pre-trained Mamba language model. Then, we empirically explore how to effectively apply the 2D vision selective scan mechanism for multimodal learning and the combinations of different vision encoders and variants of pretrained Mamba language models. The extensive experiments on diverse multimodal benchmarks with competitive performance show the effectiveness of our proposed VL-Mamba and demonstrate the great potential of applying state space models for multimodal learning tasks.
翻译:多模态大语言模型(MLLMs)吸引了广泛关注并具有丰富应用。然而,其Transformer结构固有的注意力机制需要二次方复杂度,导致计算开销高昂。因此,本文提出VL-Mamba——一种基于状态空间模型的多模态大语言模型,该模型已被证明在长序列建模中具有快速推理和序列长度线性扩展的巨大潜力。具体而言,我们首先将基于Transformer的骨干语言模型(如LLama或Vicuna)替换为预训练的Mamba语言模型。随后,我们通过实验探索如何有效应用2D视觉选择性扫描机制进行多模态学习,以及不同视觉编码器与预训练Mamba语言模型变体的组合方式。在多样化多模态基准上的大量实验展现了我们提出的VL-Mamba的有效性,并证明了将状态空间模型应用于多模态学习任务的巨大潜力。