Recent studies suggest that with sufficiently wide models, most SGD solutions can, up to permutation, converge into the same basin. This phenomenon, known as the model re-basin regime, has significant implications for model averaging by ensuring the linear mode connectivity. However, current re-basin strategies are ineffective in many scenarios due to a lack of comprehensive understanding of underlying mechanisms. Addressing this gap, this paper provides novel insights into understanding and improving the standard practice. Firstly, we decompose re-normalization into rescaling and reshift, uncovering that rescaling plays a crucial role in re-normalization while re-basin performance is sensitive to shifts in model activation. The finding calls for a more nuanced handling of the activation shift. Secondly, we identify that the merged model suffers from the issue of activation collapse and magnitude collapse. Varying the learning rate, weight decay, and initialization method can mitigate the issues and improve model performance. Lastly, we propose a new perspective to unify the re-basin and pruning, under which a lightweight yet effective post-pruning technique is derived, which can significantly improve the model performance after pruning. Our implementation is available at https://github.com/XingyuQu/rethink-re-basin.
翻译:近期研究表明,对于足够宽的模型,大多数随机梯度下降(SGD)解在考虑排列等价性的前提下,能够收敛至同一盆地。这一现象被称为模型重盆地机制,它通过确保线性模式连通性,对模型平均具有重要意义。然而,由于对底层机制缺乏全面理解,当前的重盆地策略在许多场景中效果有限。为填补这一空白,本文对理解与改进标准实践提供了新的见解。首先,我们将重归一化分解为重新缩放与重新偏移,揭示了重新缩放在重归一化中起关键作用,而重盆地性能对模型激活的偏移敏感。这一发现要求对激活偏移进行更精细的处理。其次,我们指出合并后的模型存在激活塌缩与幅度塌缩问题。调整学习率、权重衰减与初始化方法可以缓解这些问题并提升模型性能。最后,我们提出一种统一重盆地与剪枝的新视角,并由此推导出一种轻量级且有效的后剪枝技术,该技术能显著提升剪枝后的模型性能。我们的实现代码可在 https://github.com/XingyuQu/rethink-re-basin 获取。