Recent developments in Multimodal Large Language Models (MLLMs) have shown rapid progress, moving towards the goal of creating versatile MLLMs that understand inputs from various modalities. However, existing methods typically rely on joint training with paired multimodal instruction data, which is resource-intensive and challenging to extend to new modalities. In this paper, we propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model. Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters. Furthermore, we introduce DAMC to address parameter interference and mismatch issues during the merging process, thereby enhancing the model performance. To facilitate research in this area, we propose MCUB, a benchmark for assessing ability of MLLMs to understand inputs from diverse modalities. Experiments on this benchmark and four other multimodal understanding tasks show significant improvements over baselines, proving that model composition can create a versatile model capable of processing inputs from multiple modalities.
翻译:近年来,多模态大语言模型(MLLMs)的发展取得了快速进展,朝着创建能够理解多种模态输入的通用MLLMs的目标迈进。然而,现有方法通常依赖于使用配对的多模态指令数据进行联合训练,这种方法资源消耗大,且难以扩展到新的模态。本文提出一种新范式,通过对现有MLLMs进行模型组合来创建新模型,该模型保留各原始模型的模态理解能力。我们的基础实现方法NaiveMC通过复用模态编码器和合并LLM参数,证明了该范式的有效性。此外,我们引入DAMC以解决合并过程中的参数干扰与失配问题,从而提升模型性能。为促进该领域的研究,我们提出了MCUB基准,用于评估MLLMs理解多种模态输入的能力。在此基准及另外四个多模态理解任务上的实验表明,该方法相较于基线模型取得了显著提升,证明模型组合能够创建出能够处理多种模态输入的通用模型。