Recent developments in Multimodal Large Language Models (MLLMs) have shown rapid progress, moving towards the goal of creating versatile MLLMs that understand inputs from various modalities. However, existing methods typically rely on joint training with paired multimodal instruction data, which is resource-intensive and challenging to extend to new modalities. In this paper, we propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model. Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters. Furthermore, we introduce DAMC to address parameter interference and mismatch issues during the merging process, thereby enhancing the model performance. To facilitate research in this area, we propose MCUB, a benchmark for assessing ability of MLLMs to understand inputs from diverse modalities. Experiments on this benchmark and four other multimodal understanding tasks show significant improvements over baselines, proving that model composition can create a versatile model capable of processing inputs from multiple modalities.
翻译:近期多模态大语言模型(MLLMs)的发展取得了快速进展,正朝着构建能够理解多种模态输入的通用型MLLM目标迈进。然而,现有方法通常依赖配对的多模态指令数据进行联合训练,这不仅消耗大量资源,且难以扩展到新的模态。本文提出了一种新范式——通过对现有MLLMs进行模型组合,创建能够保留各原始模型模态理解能力的新模型。我们的基础实现NaiveMC通过复用模态编码器并合并LLM参数,验证了该范式的有效性。此外,我们引入DAMC以解决合并过程中的参数干扰与不匹配问题,从而提升模型性能。为促进该领域研究,我们提出了MCUB基准测试,用于评估MLLMs理解多模态输入的能力。在该基准测试及另外四项多模态理解任务上的实验表明,我们的方法相较于基线模型有显著提升,证明模型组合能够创建可处理多种模态输入的通用模型。