Recently, Ainsworth et al. showed that using weight matching (WM) to minimize the $L_2$ distance in a permutation search of model parameters effectively identifies permutations that satisfy linear mode connectivity (LMC), in which the loss along a linear path between two independently trained models with different seeds remains nearly constant. This paper provides a theoretical analysis of LMC using WM, which is crucial for understanding stochastic gradient descent's effectiveness and its application in areas like model merging. We first experimentally and theoretically show that permutations found by WM do not significantly reduce the $L_2$ distance between two models and the occurrence of LMC is not merely due to distance reduction by WM in itself. We then provide theoretical insights showing that permutations can change the directions of the singular vectors, but not the singular values, of the weight matrices in each layer. This finding shows that permutations found by WM mainly align the directions of singular vectors associated with large singular values across models. This alignment brings the singular vectors with large singular values, which determine the model functionality, closer between pre-merged and post-merged models, so that the post-merged model retains functionality similar to the pre-merged models, making it easy to satisfy LMC. Finally, we analyze the difference between WM and straight-through estimator (STE), a dataset-dependent permutation search method, and show that WM outperforms STE, especially when merging three or more models.
翻译:近期,Ainsworth等人研究表明,通过权重匹配(WM)最小化模型参数排列搜索中的$L_2$距离,能够有效识别满足线性模式连通性(LMC)的排列,此时沿两个独立训练的不同随机种子模型间线性路径的损失函数近乎恒定。本文对基于WM的LMC展开理论分析,这对于理解随机梯度下降的有效性及其在模型融合等领域的应用至关重要。我们首先通过实验与理论验证:WM所发现的排列并未实质性缩减两模型间的$L_2$距离,且LMC的出现并非单纯源于WM自身的距离缩减效应。继而提出理论洞见,揭示排列可改变各层权重矩阵奇异向量的方向,但不会改变奇异值。这一发现表明,WM所获排列主要对齐了不同模型中与大奇异值相关联的奇异向量方向。这种对齐使得决定模型功能的大奇异值对应奇异向量在合并前后模型中趋于接近,从而确保合并后模型保持与合并前模型相似的功能特性,进而易于满足LMC。最后,我们分析了WM与直通估计器(STE)——一种依赖数据集的排列搜索方法——之间的差异,证明WM在融合三个及以上模型时显著优于STE。