We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains, such as coding, math reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion with high throughput and reduced communication cost. After individual experts are asynchronously trained, BTX brings together their feedforward parameters as experts in Mixture-of-Expert (MoE) layers and averages the remaining parameters, followed by an MoE-finetuning stage to learn token-level routing. BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously. Compared to alternative approaches, BTX achieves the best accuracy-efficiency tradeoff.
翻译:我们研究了训练大型语言模型(LLM)在多个专业领域(如编程、数学推理和世界知识)中具备能力的高效方法。我们的方法名为Branch-Train-MiX(BTX),从种子模型开始,通过分支并行训练专家,具有高吞吐量和低通信开销。在异步训练单个专家后,BTX将其前馈参数作为混合专家(MoE)层中的专家合并,并平均其余参数,随后进行MoE微调阶段以学习令牌级路由。BTX推广了两个特例:Branch-Train-Merge方法(缺少MoE微调阶段以学习路由)和稀疏升级方法(省略了异步训练专家的阶段)。与替代方法相比,BTX在精度-效率权衡方面取得了最佳效果。