Task arithmetic has emerged as a simple yet powerful technique for model merging, enabling the combination of multiple finetuned models into one. Despite its empirical success, a clear theoretical explanation of why and when it works is lacking. This paper provides a rigorous theoretical foundation for task arithmetic by establishing a connection between task vectors and gradients of the task losses. We show that under standard gradient descent, a task vector generated from one epoch of finetuning is exactly equivalent to the negative gradient of the loss, scaled by the learning rate. For the practical multi-epoch setting, we prove that this equivalence holds approximately, with a second-order error term that we explicitly bound for feed-forward networks. Our empirical analysis across seven vision benchmarks corroborates our theory, demonstrating that the first-epoch gradient dominates the finetuning trajectory in both norm and direction. A key implication is that merging models finetuned for only a single epoch often yields performance comparable to merging fully converged models. These findings reframe task arithmetic as a form of approximate multitask learning, providing a clear rationale for its effectiveness and highlighting the critical role of early training dynamics in model merging.
翻译:任务算术作为一种简单而强大的模型合并技术,能够将多个微调后的模型融合为一个。尽管其经验上取得了成功,但关于其为何有效及何时有效的清晰理论解释仍然缺乏。本文通过建立任务向量与任务损失梯度之间的联系,为任务算术提供了严格的理论基础。我们证明,在标准梯度下降下,单轮微调生成的任务向量完全等价于损失负梯度乘以学习率。对于实际的多轮微调场景,我们证明了这种等价关系近似成立,并针对前馈网络显式地界定了二阶误差项。我们在七个视觉基准上的实证分析与理论相互印证,表明首轮梯度在范数和方向上均主导着微调轨迹。一个关键推论是:仅经过单轮微调模型的合并效果,通常可与完全收敛模型的合并效果相媲美。这些发现将任务算术重新阐释为一种近似多任务学习形式,为其有效性提供了明确的理论依据,并揭示了早期训练动态在模型合并中的关键作用。