Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities. Past attempts have been restricted to the case of fully-connected, convolutional, and residual networks. In this paper, we present a systematic approach for fusing two or more transformer-based networks exploiting Optimal Transport to (soft-)align the various architectural components. We flesh out an abstraction for layer alignment, that can generalize to arbitrary architectures -- in principle -- and we apply this to the key ingredients of Transformers such as multi-head self-attention, layer-normalization, and residual connections, and we discuss how to handle them via various ablation studies. Furthermore, our method allows the fusion of models of different sizes (heterogeneous fusion), providing a new and efficient way for compression of Transformers. The proposed approach is evaluated on both image classification tasks via Vision Transformer and natural language modeling tasks using BERT. Our approach consistently outperforms vanilla fusion, and, after a surprisingly short finetuning, also outperforms the individual converged parent models. In our analysis, we uncover intriguing insights about the significant role of soft alignment in the case of Transformers. Our results showcase the potential of fusing multiple Transformers, thus compounding their expertise, in the budding paradigm of model fusion and recombination.
翻译:融合是一种合并多个独立训练神经网络以整合其能力的技术。过去的尝试局限于全连接、卷积和残差网络。在本文中,我们提出了一种系统化方法来融合两个或更多基于Transformer的网络,利用最优传输对各架构组件进行(软)对齐。我们构建了一个层对齐的抽象框架,原则上可推广到任意架构,并将其应用于Transformer的关键组成部分,如多头自注意力、层归一化和残差连接,并通过多项消融研究讨论处理方式。此外,我们的方法允许融合不同规模的模型(异质融合),为Transformer压缩提供了一种新的高效途径。该方法在视觉Transformer的图像分类任务和BERT的自然语言建模任务上均进行了评估。我们的方法始终优于普通融合,在极短微调后甚至优于各自收敛的父模型。在分析中,我们发现了关于软对齐在Transformer中重要作用的深刻见解。实验结果展示了融合多个Transformer的潜力,从而在模型融合与重组的崭新范式下整合它们的专长。