Merging various task-specific Transformer-based models trained on different tasks into a single unified model can execute all the tasks concurrently. Previous methods, exemplified by task arithmetic, have been proven to be both effective and scalable. Existing methods have primarily focused on seeking a static optimal solution within the original model parameter space. A notable challenge is mitigating the interference between parameters of different models, which can substantially deteriorate performance. In this paper, we propose to merge most of the parameters while upscaling the MLP of the Transformer layers to a weight-ensembling mixture of experts (MoE) module, which can dynamically integrate shared and task-specific knowledge based on the input, thereby providing a more flexible solution that can adapt to the specific needs of each instance. Our key insight is that by identifying and separating shared knowledge and task-specific knowledge, and then dynamically integrating them, we can mitigate the parameter interference problem to a great extent. We conduct the conventional multi-task model merging experiments and evaluate the generalization and robustness of our method. The results demonstrate the effectiveness of our method and provide a comprehensive understanding of our method. The code is available at https://anonymous.4open.science/r/weight-ensembling_MoE-67C9/
翻译:将基于Transformer的不同任务专用模型合并为单一统一模型,使其能同时执行所有任务。此前的方法,如任务算术,已被证明既有效又可扩展。现有方法主要聚焦于在原始模型参数空间中寻求静态最优解。一个显著的挑战是减轻不同模型参数间的干扰,这种干扰会严重降低性能。本文提出合并大部分参数,同时将Transformer层的MLP升级为权重集成的专家混合(MoE)模块。该模块能根据输入动态整合共享知识与任务特定知识,从而提供更具灵活性的解决方案,适应每个实例的具体需求。我们的关键洞察在于:通过识别并分离共享知识与任务特定知识,再对其进行动态整合,可以大幅缓解参数干扰问题。我们进行了传统的多任务模型合并实验,并评估了方法的泛化性和鲁棒性。实验结果证明了方法的有效性,并提供了对其全面的理解。代码可于https://anonymous.4open.science/r/weight-ensembling_MoE-67C9/获取。