ATM: Improving Model Merging by Alternating Tuning and Merging

Model merging has recently emerged as a cost-efficient paradigm for multi-task learning. Among current approaches, task arithmetic stands out for its simplicity and effectiveness. In this paper, we motivate the effectiveness of task vectors by linking them to multi-task gradients. We show that in a single-epoch scenario, if the optimization is performed via gradient descent, task vectors are after one step mathematically equivalent to the gradients obtained via gradient descent in a multi-task setting, and still approximate these gradients in subsequent epochs. Furthermore, we show that the effectiveness of task vectors is largely driven by the first epoch's gradient. Given this parallel between task vectors and gradients, we propose viewing model merging as a single step in an iterative process that alternates between tuning and merging (ATM). We then propose two ways to utilize ATM. The first is to replace multi-task learning with ATM in scenarios where data sharing is prohibited, such as federated learning. The second is to improve the outcome of any model merging algorithm by applying a few post-hoc iterations of ATM on a small validation dataset, which is commonly available for hyperparameter tuning. Finally, we provide both empirical and theoretical support for the effectiveness of ATM, demonstrating that it minimizes an upper bound on the loss obtained by jointly finetuning all tasks.

翻译：模型融合最近已成为多任务学习中一种成本效益高的范式。在当前方法中，任务算术因其简洁性和有效性而备受关注。本文通过将任务向量与多任务梯度相关联，阐释了任务向量有效性的原理。我们证明，在单轮训练场景中，若优化通过梯度下降进行，则任务向量在一步优化后与多任务设置下通过梯度下降获得的梯度在数学上等价，并在后续训练轮次中仍能近似这些梯度。此外，我们发现任务向量的有效性主要由首轮训练的梯度驱动。基于任务向量与梯度之间的这种对应关系，我们提出将模型融合视为交替进行调优与合并（ATM）的迭代过程中的单一步骤。随后我们提出两种应用ATM的方式：其一是在数据共享受限的场景（如联邦学习）中用ATM替代多任务学习；其二是通过在小规模验证集（通常用于超参数调优）上实施少量ATM后处理迭代，以提升任意模型融合算法的效果。最后，我们通过实证与理论分析验证了ATM的有效性，证明其能够最小化所有任务联合微调所得损失的上界。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日