Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch

In this paper, we uncover that Language Models (LMs), either encoder- or decoder-based, can obtain new capabilities by assimilating the parameters of homologous models without retraining or GPUs. Typically, new abilities of LMs can be imparted by Supervised Fine-Tuning (SFT), reflected in the disparity between fine-tuned and pre-trained parameters (i.e., delta parameters). We initially observe that by introducing a novel operation called DARE (Drop And REscale), most delta parameters can be directly set to zeros without affecting the capabilities of SFT LMs and larger models can tolerate a higher proportion of discarded parameters. Based on this observation, we further sparsify delta parameters of multiple SFT homologous models with DARE and subsequently merge them into a single model by parameter averaging. We conduct experiments on eight datasets from the GLUE benchmark with BERT and RoBERTa. We also merge WizardLM, WizardMath, and Code Alpaca based on Llama 2. Experimental results show that: (1) The delta parameter value ranges for SFT models are typically small, often within 0.005, and DARE can eliminate 99% of them effortlessly. However, once the models are continuously pre-trained, the value ranges can grow to around 0.03, making DARE impractical. We have also tried to remove fine-tuned instead of delta parameters and find that a 10% reduction can lead to drastically decreased performance (even to 0). This highlights that SFT merely stimulates the abilities via delta parameters rather than injecting new abilities into LMs; (2) DARE can merge multiple task-specific LMs into one LM with diverse abilities. For instance, the merger of WizardLM and WizardMath improves the GSM8K zero-shot accuracy of WizardLM from 2.2 to 66.3, retaining its instruction-following ability while surpassing WizardMath's original 64.2 performance. Codes are available at https://github.com/yule-BUAA/MergeLM.

翻译：本文发现，基于编码器或解码器的语言模型（LMs）可以通过吸收同构模型的参数获得新能力，而无需重新训练或GPU。通常，语言模型的新能力可通过监督微调（SFT）赋予，这体现在微调参数与预训练参数的差异（即增量参数）中。我们首先观察到，通过引入名为DARE（Drop And REscale）的新操作，大部分增量参数可直接设为零，且不影响SFT语言模型的能力，同时更大的模型能容忍更高比例的丢弃参数。基于此观察，我们进一步使用DARE对多个同构SFT模型的增量参数进行稀疏化，并通过参数平均将其合并为单一模型。我们在GLUE基准的八个数据集上使用BERT和RoBERTa进行实验，并合并了基于Llama 2的WizardLM、WizardMath和Code Alpaca。实验结果表明：（1）SFT模型的增量参数值范围通常很小（常小于0.005），DARE可轻松消除其中99%的参数。然而，一旦模型进行持续预训练，参数值范围可能增长至0.03左右，使得DARE不可行。我们还尝试移除微调参数而非增量参数，发现仅减少10%就可能导致性能急剧下降（甚至降至0）。这凸显了SFT仅通过增量参数激发能力，而非向语言模型注入新能力；（2）DARE可将多个任务特定语言模型合并为一个具备多样能力的语言模型。例如，合并WizardLM和WizardMath后，WizardLM的GSM8K零样本准确率从2.2提升至66.3，在保留指令遵循能力的同时超越了WizardMath原有的64.2性能。代码见https://github.com/yule-BUAA/MergeLM。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日