Recent studies have demonstrated Large Language Models (LLMs) can extend their zero-shot generalization capabilities to multimodal learning through instruction tuning. As more modalities and downstream tasks are introduced, negative conflicts and interference may have a worse impact on performance. While this phenomenon has been overlooked in previous work, we propose a novel and extensible framework, called Octavius, for comprehensive studies and experimentation on multimodal learning with Multimodal Large Language Models (MLLMs). Specifically, we combine the well-known Mixture-of-Experts (MoE) and one of the representative PEFT techniques, i.e., LoRA, designing a novel LLM-based decoder, called LoRA-MoE, for multimodal learning. To the best of our knowledge, we are one of the pioneering efforts to introduce MoE into MLLMs to address this problem. The experimental results (about 20% improvement) have shown the effectiveness and versatility of our design in various 2D and 3D downstream tasks. Code and datasets are available at https://openlamm.github.io/tutorial/.
翻译:近期研究表明,大型语言模型(LLMs)能够通过指令微调将其零样本泛化能力扩展至多模态学习。随着更多模态与下游任务的引入,负面冲突与干扰可能对性能产生更严重的影响。尽管这一现象在以往工作中被忽视,我们提出了一个新颖且可扩展的框架——Octavius,用于对多模态大语言模型(MLLMs)进行多模态学习的系统性研究与实验。具体而言,我们结合了广为人知的混合专家(MoE)方法与代表性参数高效微调(PEFT)技术之一——LoRA,设计了一种基于LLM的新型解码器,称为LoRA-MoE,用于多模态学习。据我们所知,我们是率先将MoE引入MLLMs以解决此问题的研究之一。实验结果(约20%的性能提升)证明了我们设计在多种2D与3D下游任务中的有效性与通用性。代码与数据集发布于 https://openlamm.github.io/tutorial/。