Models trained on different datasets can be merged by a weighted-averaging of their parameters, but why does it work and when can it fail? Here, we connect the inaccuracy of weighted-averaging to mismatches in the gradients and propose a new uncertainty-based scheme to improve the performance by reducing the mismatch. The connection also reveals implicit assumptions in other schemes such as averaging, task arithmetic, and Fisher-weighted averaging. Our new method gives consistent improvements for large language models and vision transformers, both in terms of performance and robustness to hyperparameters.
翻译:在不同数据集上训练的模型可以通过对其参数进行加权平均来合并,但为何这种方法有效,又可能在何时失效?本文发现加权平均的不准确性源于梯度不匹配,并提出一种新的基于不确定性的方案,通过减少不匹配来提升性能。这一关联也揭示了平均法、任务算术法以及Fisher加权平均法等方案中隐含的假设。我们的新方法在大语言模型和视觉Transformer上均能持续提升性能,且对超参数具有更优的鲁棒性。