In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the foundation model for many downstream tasks, current MLLMs are composed of the well-known Transformer network, which has a less efficient quadratic computation complexity. To improve the efficiency of such basic models, we propose Cobra, a linear computational complexity MLLM. Specifically, Cobra integrates the efficient Mamba language model into the visual modality. Moreover, we explore and study various modal fusion schemes to create an effective multi-modal Mamba. Extensive experiments demonstrate that (1) Cobra achieves extremely competitive performance with current computationally efficient state-of-the-art methods, \textit{e.g.}, LLaVA-Phi, TinyLLaVA, and MobileVLM v2, and has faster speed due to Cobra's linear sequential modeling. (2) Interestingly, the results of closed-set challenging prediction benchmarks show that Cobra performs well in overcoming visual illusions and spatial relationship judgments. (3) Notably, Cobra even achieves comparable performance to LLaVA with about 43% of the number of parameters. We will make all codes of Cobra open-source and hope that the proposed method can facilitate future research on complexity problems in MLLM. Our project page is available at: https://sites.google.com/view/cobravlm.
翻译:近年来,多模态大语言模型(MLLM)在各个领域的应用取得了显著成功。然而,作为众多下游任务的基础模型,当前MLLM由著名的Transformer网络构成,该网络具有效率较低的二次计算复杂度。为提升这类基础模型的效率,我们提出Cobra——一种线性计算复杂度的MLLM。具体而言,Cobra将高效的Mamba语言模型集成至视觉模态中。此外,我们探索并研究了多种模态融合方案,以构建有效的多模态Mamba。大量实验表明:(1)Cobra在性能上极具竞争力,与当前计算高效的最先进方法(例如LLaVA-Phi、TinyLLaVA及MobileVLM v2)相比,因Cobra的线性序列建模而具备更快的速度。(2)有趣的是,封闭集挑战性预测基准的结果显示,Cobra在克服视觉错觉及空间关系判断方面表现优异。(3)值得关注的是,Cobra在仅使用约43%参数量的情况下,性能甚至与LLaVA相当。我们将开源Cobra的全部代码,并期望所提方法能推动MLLM中复杂度问题的未来研究。项目页面详见:https://sites.google.com/view/cobravlm。