Underlying data distributions of natural language, programming code, and mathematical symbols vary vastly, presenting a complex challenge for large language models (LLMs) that strive to achieve high performance across all three domains simultaneously. Achieving a very high level of proficiency for an LLM within a specific domain often requires extensive training with relevant corpora, which is typically accompanied by a sacrifice in performance in other domains. In this paper, we propose to fuse models that are already highly-specialized directly. The proposed fusing framework, UltraFuser, consists of three distinct specialists that are already sufficiently trained on language, coding, and mathematics. A token-level gating mechanism is introduced to blend the specialists' outputs. A two-stage training strategy accompanied by balanced sampling is designed to ensure stability. To effectively train the fused model, we further construct a high-quality supervised instruction tuning dataset, UltraChat 2, which includes text, code, and mathematical content. This dataset comprises approximately 300,000 instructions and covers a wide range of topics in each domain. Experiments show that our model could simultaneously achieve mastery of the three crucial domains.
翻译:自然语言、编程代码与数学符号的底层数据分布存在巨大差异,这对力求在三个领域同时实现高性能的大语言模型(LLMs)构成了复杂挑战。若要使LLM在特定领域达到极高熟练度,通常需要借助大量相关语料进行训练,但这往往伴随着其他领域性能的牺牲。本文提出直接融合已具备高度专业能力的模型。所提出的融合框架UltraFuser包含三个分别经过充分训练的语言、编程与数学专业模型。我们引入了一种词元级门控机制来混合各专业模型的输出,并设计了一种结合平衡采样的两阶段训练策略以保障训练稳定性。为有效训练融合模型,我们进一步构建了包含文本、代码与数学内容的高质量监督指令微调数据集UltraChat 2。该数据集包含约30万条指令,覆盖各领域的广泛主题。实验表明,我们的模型能够同时掌握这三个关键领域。