The rapid development of large language and vision models (LLVMs) has been driven by advances in visual instruction tuning. Recently, open-source LLVMs have curated high-quality visual instruction tuning datasets and utilized additional vision encoders or multiple computer vision models in order to narrow the performance gap with powerful closed-source LLVMs. These advancements are attributed to multifaceted information required for diverse capabilities, including fundamental image understanding, real-world knowledge about common-sense and non-object concepts (e.g., charts, diagrams, symbols, signs, and math problems), and step-by-step procedures for solving complex questions. Drawing from the multifaceted information, we present a new efficient LLVM, Mamba-based traversal of rationales (Meteor), which leverages multifaceted rationale to enhance understanding and answering capabilities. To embed lengthy rationales containing abundant information, we employ the Mamba architecture, capable of processing sequential data with linear time complexity. We introduce a new concept of traversal of rationale that facilitates efficient embedding of rationale. Subsequently, the backbone multimodal language model (MLM) is trained to generate answers with the aid of rationale. Through these steps, Meteor achieves significant improvements in vision language performances across multiple evaluation benchmarks requiring diverse capabilities, without scaling up the model size or employing additional vision encoders and computer vision models.
翻译:大型语言与视觉模型(LLVM)的快速发展得益于视觉指令微调技术的进步。近期,开源LLVM通过构建高质量视觉指令微调数据集,并引入额外视觉编码器或多个计算机视觉模型,以缩小与强大闭源LLVM的性能差距。这些进展归因于多样化能力所需的多维度信息,包括基础图像理解、关于常识与非实体概念(如图表、图示、符号、标志和数学问题)的现实世界知识,以及解决复杂问题的分步推理流程。基于多维度信息,我们提出一种新型高效LLVM——基于Mamba的理性遍历模型(Meteor),该模型利用多维度理性信息来增强理解与应答能力。为嵌入包含丰富信息的长序列理性内容,我们采用具有线性时间复杂度的Mamba架构来处理序列数据。我们引入理性遍历的新概念,以实现理性信息的高效嵌入。随后,训练主干多模态语言模型(MLM)在理性信息的辅助下生成答案。通过这些步骤,Meteor在需要多样化能力的多个评估基准上实现了视觉语言性能的显著提升,且无需扩大模型规模或使用额外视觉编码器与计算机视觉模型。