Rethinking Model Re-Basin and Linear Mode Connectivity

Recent studies suggest that with sufficiently wide models, most SGD solutions can, up to permutation, converge into the same basin. This phenomenon, known as the model re-basin regime, has significant implications for model averaging by ensuring the linear mode connectivity. However, current re-basin strategies are ineffective in many scenarios due to a lack of comprehensive understanding of underlying mechanisms. Addressing this gap, this paper provides novel insights into understanding and improving the standard practice. Firstly, we decompose re-normalization into rescaling and reshift, uncovering that rescaling plays a crucial role in re-normalization while re-basin performance is sensitive to shifts in model activation. The finding calls for a more nuanced handling of the activation shift. Secondly, we identify that the merged model suffers from the issue of activation collapse and magnitude collapse. Varying the learning rate, weight decay, and initialization method can mitigate the issues and improve model performance. Lastly, we propose a new perspective to unify the re-basin and pruning, under which a lightweight yet effective post-pruning technique is derived, which can significantly improve the model performance after pruning. Our implementation is available at https://github.com/XingyuQu/rethink-re-basin.

翻译：近期研究表明，对于足够宽的模型，大多数随机梯度下降（SGD）解在考虑排列等价性的前提下，能够收敛至同一盆地。这一现象被称为模型重盆地机制，它通过确保线性模式连通性，对模型平均具有重要意义。然而，由于对底层机制缺乏全面理解，当前的重盆地策略在许多场景中效果有限。为填补这一空白，本文对理解与改进标准实践提供了新的见解。首先，我们将重归一化分解为重新缩放与重新偏移，揭示了重新缩放在重归一化中起关键作用，而重盆地性能对模型激活的偏移敏感。这一发现要求对激活偏移进行更精细的处理。其次，我们指出合并后的模型存在激活塌缩与幅度塌缩问题。调整学习率、权重衰减与初始化方法可以缓解这些问题并提升模型性能。最后，我们提出一种统一重盆地与剪枝的新视角，并由此推导出一种轻量级且有效的后剪枝技术，该技术能显著提升剪枝后的模型性能。我们的实现代码可在 https://github.com/XingyuQu/rethink-re-basin 获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日