Mixture-of-Experts (MoE) shines brightly in large language models (LLMs) and demonstrates outstanding performance in plentiful natural language processing tasks. However, existing methods transforming LLMs from dense to MoE face significant data requirements and typically rely on large-scale post-training. In this paper, we propose Upcycling Instruction Tuning (UpIT), a data-efficient approach for tuning a dense pre-trained model into a MoE instruction model. Specifically, we first point out that intermediate checkpoints during instruction tuning of the dense model are naturally suitable for specialized experts, and then propose an expert expansion stage to flexibly achieve models with flexible numbers of experts, where genetic algorithm and parameter merging are introduced to ensure sufficient diversity of new extended experts. To ensure that each specialized expert in the MoE model works as expected, we select a small amount of seed data that each expert excels to pre-optimize the router. Extensive experiments with various data scales and upcycling settings demonstrate the outstanding performance and data efficiency of UpIT, as well as stable improvement in expert or data scaling. Further analysis reveals the importance of ensuring expert diversity in upcycling.
翻译:混合专家模型在大语言模型中表现卓越,在众多自然语言处理任务中展现出优异的性能。然而,现有将大语言模型从稠密架构转换为混合专家架构的方法面临显著的数据需求,通常依赖于大规模的后训练。本文提出指令微调升级方法,这是一种数据高效的方法,用于将预训练的稠密模型调整为混合专家指令模型。具体而言,我们首先指出稠密模型在指令微调过程中的中间检查点天然适合作为专业化专家,随后提出专家扩展阶段以灵活实现具有可变专家数量的模型,其中引入遗传算法和参数融合技术来确保新扩展专家具有充分的多样性。为确保混合专家模型中每个专业化专家按预期工作,我们选取少量各专家擅长的种子数据对路由器进行预优化。在不同数据规模和升级设置下的大量实验证明了指令微调升级方法的优异性能和数据效率,以及在专家或数据扩展时的稳定改进效果。进一步分析揭示了在升级过程中确保专家多样性的重要性。