FuseChat: Knowledge Fusion of Chat Models

While training large language models (LLMs) from scratch can indeed lead to models with distinct capabilities and strengths, this approach incurs substantial costs and may lead to potential redundancy in competencies. An alternative strategy is to combine existing LLMs into a more robust LLM, thereby diminishing the necessity for expensive pre-training. However, due to the diverse architectures of LLMs, direct parameter blending proves to be unfeasible. Recently, \textsc{FuseLLM} introduced the concept of knowledge fusion to transfer the collective knowledge of multiple structurally varied LLMs into a target LLM through lightweight continual training. In this report, we extend the scalability and flexibility of the \textsc{FuseLLM} framework to realize the fusion of chat LLMs, resulting in \textsc{FuseChat}. \textsc{FuseChat} comprises two main stages. Firstly, we undertake knowledge fusion for structurally and scale-varied source LLMs to derive multiple target LLMs of identical structure and size via lightweight fine-tuning. Then, these target LLMs are merged within the parameter space, wherein we propose a novel method for determining the merging weights based on the variation ratio of parameter matrices before and after fine-tuning. We validate our approach using three prominent chat LLMs with diverse architectures and scales, namely \texttt{NH2-Mixtral-8x7B}, \texttt{NH2-Solar-10.7B}, and \texttt{OpenChat-3.5-7B}. Experimental results spanning various chat domains demonstrate the superiority of \texttt{\textsc{FuseChat}-7B} across a broad spectrum of chat LLMs at 7B and 34B scales, even surpassing \texttt{GPT-3.5 (March)} and approaching \texttt{Mixtral-8x7B-Instruct}. Our code, model weights, and data are openly accessible at \url{https://github.com/fanqiwan/FuseLLM}.

翻译：尽管从头训练大型语言模型（LLMs）确实能够产生具有独特能力和优势的模型，但这种方法成本高昂，且可能导致能力冗余。另一种策略是将现有LLMs合并为更强大的LLM，从而减少昂贵预训练的必要性。然而，由于LLMs架构多样，直接进行参数混合并不可行。近期，\textsc{FuseLLM}引入了知识融合概念，通过轻量级持续训练，将多个结构各异的LLMs的集体知识迁移至目标LLM。在本报告中，我们扩展了\textsc{FuseLLM}框架的可扩展性与灵活性，以实现聊天LLMs的融合，由此产生\textsc{FuseChat}。\textsc{FuseChat}包含两个主要阶段：首先，我们对结构和规模各异的源LLMs进行知识融合，通过轻量级微调获得多个结构和大小相同的目标LLM；随后，在参数空间内合并这些目标LLM，并提出一种基于微调前后参数矩阵变化率来确定合并权重的新方法。我们使用三个架构和规模各异的知名聊天LLMs验证了该方法，即\texttt{NH2-Mixtral-8x7B}、\texttt{NH2-Solar-10.7B}和\texttt{OpenChat-3.5-7B}。跨多个聊天领域的实验结果表明，\texttt{\textsc{FuseChat}-7B}在7B和34B规模的广泛聊天LLMs中展现出优越性，甚至超越\texttt{GPT-3.5 (March)}并接近\texttt{Mixtral-8x7B-Instruct}。我们的代码、模型权重及数据已开源发布于 \url{https://github.com/fanqiwan/FuseLLM}。