Underlying data distributions of natural language, programming code, and mathematical symbols vary vastly, presenting a complex challenge for large language models (LLMs) that strive to achieve high performance across all three domains simultaneously. Achieving a very high level of proficiency for an LLM within a specific domain often requires extensive training with relevant corpora, which is typically accompanied by a sacrifice in performance in other domains. In this paper, we propose to fuse models that are already highly-specialized directly. The proposed fusing framework, UltraFuser, consists of three distinct specialists that are already sufficiently trained on language, coding, and mathematics. A token-level gating mechanism is introduced to blend the specialists' outputs. A two-stage training strategy accompanied by balanced sampling is designed to ensure stability. To effectively train the fused model, we further construct a high-quality supervised instruction tuning dataset, UltraChat 2, which includes text, code, and mathematical content. This dataset comprises approximately 300,000 instructions and covers a wide range of topics in each domain. Experiments show that our model could simultaneously achieve mastery of the three crucial domains.
翻译:自然语言、编程代码与数学符号的底层数据分布存在显著差异,这对试图在三个领域同时实现高性能的大语言模型构成了严峻挑战。若要在特定领域达到极高熟练度,通常需要大量相关语料训练,而这往往伴随着其他领域性能的牺牲。本文提出直接融合已高度特化的模型。所提出的融合框架UltraFuser包含三个已在语言、编程和数学领域充分训练的专用模型。我们引入词元级门控机制来融合专家模型的输出,并设计了两阶段训练策略与平衡采样方法确保训练稳定性。为有效训练融合模型,我们进一步构建了高质量监督指令微调数据集UltraChat 2,该数据集包含约30万条指令,覆盖文本、代码和数学三大领域的广泛主题。实验表明,我们的模型可同时在三个关键领域达到精通水平。