Model merging (e.g., via interpolation or task arithmetic) fuses multiple models trained on different tasks to generate a multi-task solution. The technique has been proven successful in previous studies, where the models are trained on similar tasks and with the same initialization. In this paper, we expand on this concept to a multimodal setup by merging transformers trained on different modalities. Furthermore, we conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture to create a parameter-efficient modality-agnostic architecture. Through comprehensive experiments, we systematically investigate the key factors impacting model performance after merging, including initialization, merging mechanisms, and model architectures. We also propose two metrics that assess the distance between weights to be merged and can serve as an indicator of the merging outcomes. Our analysis leads to an effective training recipe for matching the performance of the modality-agnostic baseline (i.e., pre-trained from scratch) via model merging. Our method also outperforms naive merging significantly on various tasks, with improvements of 3% on VQA, 7% on COCO retrieval, 25% on NLVR2, 14% on Flickr30k and 3% on ADE20k. Our code is available at https://github.com/ylsung/vl-merging
翻译:模型融合(例如通过插值或任务算术)能够整合在不同任务上训练的多个模型,从而生成一个多任务解决方案。先前的研究已证明该技术在基于相似任务及相同初始化训练的模型上取得了成功。本文将此概念扩展至多模态场景,通过融合在不同模态上训练的Transformer模型来实现。此外,我们针对一个全新目标展开研究:融合特定模态架构中的视觉、语言及跨模态Transformer,以构建参数高效的模态无关架构。通过全面实验,我们系统探究了影响融合后模型性能的关键因素,包括初始化、融合机制及模型架构。同时提出两个评估待融合权重间距离的指标,可作为融合结果的预测因子。基于分析,我们推导出一套有效的训练方案,使模型融合后的性能能够匹配从头预训练的模态无关基线模型。我们的方法在各类任务上显著优于简单融合:在VQA上提升3%,COCO检索提升7%,NLVR2提升25%,Flickr30k提升14%,ADE20k提升3%。代码已开源至https://github.com/ylsung/vl-merging