Underlying data distributions of natural language, programming code, and mathematical symbols vary vastly, presenting a complex challenge for large language models (LLMs) that strive to achieve high performance across all three domains simultaneously. Achieving a very high level of proficiency for an LLM within a specific domain often requires extensive training with relevant corpora, which is typically accompanied by a sacrifice in performance in other domains. In this paper, we propose to fuse models that are already highly-specialized directly. The proposed fusing framework, UltraFuser, consists of three distinct specialists that are already sufficiently trained on language, coding, and mathematics. A token-level gating mechanism is introduced to blend the specialists' outputs. A two-stage training strategy accompanied by balanced sampling is designed to ensure stability. To effectively train the fused model, we further construct a high-quality supervised instruction tuning dataset, UltraChat 2, which includes text, code, and mathematical content. This dataset comprises approximately 300,000 instructions and covers a wide range of topics in each domain. Experiments show that our model could simultaneously achieve mastery of the three crucial domains.
翻译:自然语言、编程代码和数学符号的底层数据分布差异显著,这给试图同时在三个领域实现高性能的大型语言模型(LLMs)带来了复杂挑战。要让LLM在特定领域达到极高熟练度,通常需要使用相关语料进行大量训练,但这往往伴随着其他领域性能的牺牲。本文提出直接融合已高度专业化的模型。所提出的融合框架UltraFuser包含三个已在语言、编程和数学领域充分训练的独立专家。我们引入了一种词元级门控机制来混合专家输出,并设计了结合平衡采样的两阶段训练策略以确保稳定性。为有效训练融合模型,我们进一步构建了包含文本、代码和数学内容的高质量监督指令微调数据集UltraChat 2。该数据集包含约30万条指令,覆盖各领域的广泛主题。实验表明,我们的模型能够同时精通这三个关键领域。