Multimodal Large Language Models (MLLMs) have gained significant attention due to their impressive capabilities in multimodal understanding. However, existing methods rely heavily on extensive modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. In this paper, we propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities that enables MLLMs to continually EVolve on modalities for $\mathbb{X}$-modal reasoning. We leverage the concept of Continual Learning and develop an incremental training strategy atop pre-trained MLLMs, enabling their expansion to new modalities using uni-modal data, without executing joint-modal pretraining. In detail, a novel Adapter-in-Adapter (AnA) framework is introduced, in which uni-modal and cross-modal adapters are seamlessly integrated to facilitate efficient modality alignment and collaboration. Additionally, an MoE-based gating module is applied between two types of adapters to further enhance the multimodal interaction. To investigate the proposed method, we establish a challenging benchmark called Continual Learning of Modality (MCL), which consists of high-quality QA data from five distinct modalities: image, video, audio, depth and point cloud. Extensive experiments demonstrate the effectiveness of the proposed AnA framework on learning plasticity and memory stability during continual learning. Furthermore, PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%. Our code locates at https://github.com/JiazuoYu/PathWeave
翻译:多模态大语言模型因其在跨模态理解方面的卓越能力而受到广泛关注。然而,现有方法严重依赖大量模态特定预训练与联合模态微调,导致在扩展至新模态时产生显著计算负担。本文提出PathWeave——一个具备模态路径切换与扩展能力的灵活可扩展框架,使多模态大语言模型能够在模态上持续演进,实现$\mathbb{X}$模态推理。我们借鉴持续学习理念,在预训练多模态大语言模型基础上开发增量训练策略,使其能够利用单模态数据扩展至新模态,而无需执行联合模态预训练。具体而言,我们引入新颖的适配器嵌套框架,其中单模态与跨模态适配器被无缝集成,以促进高效的模态对齐与协作。此外,在两类适配器间应用基于混合专家的门控模块,进一步增强多模态交互。为验证所提方法,我们建立了名为模态持续学习的基准测试集,该数据集包含来自图像、视频、音频、深度及点云五种不同模态的高质量问答数据。大量实验证明所提适配器嵌套框架在持续学习过程中具有优异的学习可塑性与记忆稳定性。进一步地,PathWeave在达到与最先进多模态大语言模型相当性能的同时,将参数训练负担降低了98.73%。代码开源地址:https://github.com/JiazuoYu/PathWeave