Model merging (e.g., via interpolation or task arithmetic) fuses multiple models trained on different tasks to generate a multi-task solution. The technique has been proven successful in previous studies, where the models are trained on similar tasks and with the same initialization. In this paper, we expand on this concept to a multimodal setup by merging transformers trained on different modalities. Furthermore, we conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture to create a parameter-efficient modality-agnostic architecture. Through comprehensive experiments, we systematically investigate the key factors impacting model performance after merging, including initialization, merging mechanisms, and model architectures. Our analysis leads to an effective training recipe for matching the performance of the modality-agnostic baseline (i.e. pre-trained from scratch) via model merging. Our code is available at: https://github.com/ylsung/vl-merging
翻译:模型融合(例如通过插值或任务算术)将多个在不同任务上训练的模型融合以生成多任务解决方案。该技术已在先前研究中被证明是成功的,其中模型在相似任务上训练且具有相同的初始化。本文将该概念扩展到多模态设置,通过融合在不同模态上训练的Transformer模型。此外,我们针对一个新目标展开研究,即融合特定模态架构中的视觉、语言及跨模态Transformer,从而构建参数高效且与模态无关的架构。通过全面实验,我们系统性地探究了影响融合后模型性能的关键因素,包括初始化、融合机制及模型架构。我们的分析得出一套有效的训练策略,能够通过模型融合匹配从头预训练的跨模态基线性能。我们的代码开源在:https://github.com/ylsung/vl-merging