In this paper, we uncover that Language Models (LMs), either encoder- or decoder-based, can obtain new capabilities by assimilating the parameters of homologous models without retraining or GPUs. Typically, new abilities of LMs can be imparted by Supervised Fine-Tuning (SFT), reflected in the disparity between fine-tuned and pre-trained parameters (i.e., delta parameters). We initially observe that by introducing a novel operation called DARE (Drop And REscale), most delta parameters can be directly set to zeros without affecting the capabilities of SFT LMs and larger models can tolerate a higher proportion of discarded parameters. Based on this observation, we further sparsify delta parameters of multiple SFT homologous models with DARE and subsequently merge them into a single model by parameter averaging. We conduct experiments on eight datasets from the GLUE benchmark with BERT and RoBERTa. We also merge WizardLM, WizardMath, and Code Alpaca based on Llama 2. Experimental results show that: (1) The delta parameter value ranges for SFT models are typically small, often within 0.005, and DARE can eliminate 99% of them effortlessly. However, once the models are continuously pre-trained, the value ranges can grow to around 0.03, making DARE impractical. We have also tried to remove fine-tuned instead of delta parameters and find that a 10% reduction can lead to drastically decreased performance (even to 0). This highlights that SFT merely stimulates the abilities via delta parameters rather than injecting new abilities into LMs; (2) DARE can merge multiple task-specific LMs into one LM with diverse abilities. For instance, the merger of WizardLM and WizardMath improves the GSM8K zero-shot accuracy of WizardLM from 2.2 to 66.3, retaining its instruction-following ability while surpassing WizardMath's original 64.2 performance. Codes are available at https://github.com/yule-BUAA/MergeLM.
翻译:本文发现,基于编码器或解码器的语言模型(LMs)可以通过吸收同构模型的参数获得新能力,而无需重新训练或GPU。通常,语言模型的新能力可通过监督微调(SFT)赋予,这体现在微调参数与预训练参数的差异(即增量参数)中。我们首先观察到,通过引入名为DARE(Drop And REscale)的新操作,大部分增量参数可直接设为零,且不影响SFT语言模型的能力,同时更大的模型能容忍更高比例的丢弃参数。基于此观察,我们进一步使用DARE对多个同构SFT模型的增量参数进行稀疏化,并通过参数平均将其合并为单一模型。我们在GLUE基准的八个数据集上使用BERT和RoBERTa进行实验,并合并了基于Llama 2的WizardLM、WizardMath和Code Alpaca。实验结果表明:(1)SFT模型的增量参数值范围通常很小(常小于0.005),DARE可轻松消除其中99%的参数。然而,一旦模型进行持续预训练,参数值范围可能增长至0.03左右,使得DARE不可行。我们还尝试移除微调参数而非增量参数,发现仅减少10%就可能导致性能急剧下降(甚至降至0)。这凸显了SFT仅通过增量参数激发能力,而非向语言模型注入新能力;(2)DARE可将多个任务特定语言模型合并为一个具备多样能力的语言模型。例如,合并WizardLM和WizardMath后,WizardLM的GSM8K零样本准确率从2.2提升至66.3,在保留指令遵循能力的同时超越了WizardMath原有的64.2性能。代码见https://github.com/yule-BUAA/MergeLM。