Recent studies have demonstrated Large Language Models (LLMs) can extend their zero-shot generalization capabilities to multimodal learning through instruction tuning. As more modalities and downstream tasks are introduced, negative conflicts and interference may have a worse impact on performance. While this phenomenon has been overlooked in previous work, we propose a novel and extensible framework, called Octavius, for comprehensive studies and experimentation on multimodal learning with Multimodal Large Language Models (MLLMs). Specifically, we combine the well-known Mixture-of-Experts (MoE) and one of the representative PEFT techniques, i.e., LoRA, designing a novel LLM-based decoder, called LoRA-MoE, for multimodal learning. To the best of our knowledge, we are one of the pioneering efforts to introduce MoE into MLLMs to address this problem. The experimental results (about 20% improvement) have shown the effectiveness and versatility of our design in various 2D and 3D downstream tasks. Code and datasets are available at https://openlamm.github.io/paper_list/Octavius.
翻译:摘要:近期研究表明,大语言模型(LLMs)可通过指令调优将其零样本泛化能力拓展至多模态学习。然而,随着引入更多模态与下游任务,负向冲突与干扰可能对性能产生更严重的影响。针对这一先前研究中被忽视的现象,我们提出一个新颖且可扩展的框架Octavius,用于对多模态大语言模型(MLLMs)进行系统性研究与实验。具体而言,我们将著名的混合专家模型(MoE)与代表性参数高效微调(PEFT)技术LoRA相结合,设计了一种基于LLM的新型解码器LoRA-MoE,专用于多模态学习。据我们所知,我们是率先将MoE引入MLLMs以解决该问题的先驱工作之一。实验结果表明(性能提升约20%),我们的设计在各类2D与3D下游任务中具有有效性与通用性。相关代码与数据集已公开于https://openlamm.github.io/paper_list/Octavius。