Model merging has recently emerged as a cost-efficient paradigm for multi-task learning. Among current approaches, task arithmetic stands out for its simplicity and effectiveness. In this paper, we motivate the effectiveness of task vectors by linking them to multi-task gradients. We show that in a single-epoch scenario, if the optimization is performed via gradient descent, task vectors are after one step mathematically equivalent to the gradients obtained via gradient descent in a multi-task setting, and still approximate these gradients in subsequent epochs. Furthermore, we show that the effectiveness of task vectors is largely driven by the first epoch's gradient. Given this parallel between task vectors and gradients, we propose viewing model merging as a single step in an iterative process that alternates between tuning and merging (ATM). We then propose two ways to utilize ATM. The first is to replace multi-task learning with ATM in scenarios where data sharing is prohibited, such as federated learning. The second is to improve the outcome of any model merging algorithm by applying a few post-hoc iterations of ATM on a small validation dataset, which is commonly available for hyperparameter tuning. Finally, we provide both empirical and theoretical support for the effectiveness of ATM, demonstrating that it minimizes an upper bound on the loss obtained by jointly finetuning all tasks.
翻译:模型融合最近已成为多任务学习中一种成本效益高的范式。在当前方法中,任务算术因其简洁性和有效性而备受关注。本文通过将任务向量与多任务梯度相关联,阐释了任务向量有效性的原理。我们证明,在单轮训练场景中,若优化通过梯度下降进行,则任务向量在一步优化后与多任务设置下通过梯度下降获得的梯度在数学上等价,并在后续训练轮次中仍能近似这些梯度。此外,我们发现任务向量的有效性主要由首轮训练的梯度驱动。基于任务向量与梯度之间的这种对应关系,我们提出将模型融合视为交替进行调优与合并(ATM)的迭代过程中的单一步骤。随后我们提出两种应用ATM的方式:其一是在数据共享受限的场景(如联邦学习)中用ATM替代多任务学习;其二是通过在小规模验证集(通常用于超参数调优)上实施少量ATM后处理迭代,以提升任意模型融合算法的效果。最后,我们通过实证与理论分析验证了ATM的有效性,证明其能够最小化所有任务联合微调所得损失的上界。