While training large language models (LLMs) from scratch can indeed lead to models with distinct capabilities and strengths, this approach incurs substantial costs and may lead to potential redundancy in competencies. An alternative strategy is to combine existing LLMs into a more robust LLM, thereby diminishing the necessity for expensive pre-training. However, due to the diverse architectures of LLMs, direct parameter blending proves to be unfeasible. Recently, \textsc{FuseLLM} introduced the concept of knowledge fusion to transfer the collective knowledge of multiple structurally varied LLMs into a target LLM through lightweight continual training. In this report, we extend the scalability and flexibility of the \textsc{FuseLLM} framework to realize the fusion of chat LLMs, resulting in \textsc{FuseChat}. \textsc{FuseChat} comprises two main stages. Firstly, we undertake knowledge fusion for structurally and scale-varied source LLMs to derive multiple target LLMs of identical structure and size via lightweight fine-tuning. Then, these target LLMs are merged within the parameter space, wherein we propose a novel method for determining the merging weights based on the variation ratio of parameter matrices before and after fine-tuning. We validate our approach using three prominent chat LLMs with diverse architectures and scales, namely \texttt{NH2-Mixtral-8x7B}, \texttt{NH2-Solar-10.7B}, and \texttt{OpenChat-3.5-7B}. Experimental results spanning various chat domains demonstrate the superiority of \texttt{\textsc{FuseChat}-7B} across a broad spectrum of chat LLMs at 7B and 34B scales, even surpassing \texttt{GPT-3.5 (March)} and approaching \texttt{Mixtral-8x7B-Instruct}. Our code, model weights, and data are openly accessible at \url{https://github.com/fanqiwan/FuseLLM}.
翻译:从头训练大型语言模型(LLMs)虽然能带来具备独特能力与优势的模型,但该方法成本高昂且可能导致能力冗余。另一种策略是将现有LLMs整合为更强大的模型,从而降低昂贵预训练的必要性。然而,由于LLMs架构各异,直接进行参数混合并不可行。近期,FuseLLM提出了知识融合的概念,通过轻量级持续训练,将多个结构各异的LLMs的集体知识迁移至目标LLM。本报告中,我们扩展了FuseLLM框架的可扩展性与灵活性,实现了聊天LLMs的融合,即FuseChat。FuseChat包含两个主要阶段:首先,对结构各异、规模不同的源LLMs进行知识融合,通过轻量级微调得到多个结构相同、规模一致的目标LLM;随后,在参数空间中合并这些目标LLM,我们提出了一种基于微调前后参数矩阵变化率确定合并权重的新方法。我们使用三个架构与规模各异的代表性聊天LLM(NH2-Mixtral-8x7B、NH2-Solar-10.7B和OpenChat-3.5-7B)验证了该方法。跨多个聊天领域的实验结果表明,FuseChat-7B在7B与34B规模的大量聊天LLM中均展现出优越性,甚至超越GPT-3.5(March版本),接近Mixtral-8x7B-Instruct。我们的代码、模型权重及数据已在https://github.com/fanqiwan/FuseLLM 开放获取。