Mixture-of-experts (MoE) models facilitate efficient scaling; however, training the router network introduces the challenge of optimizing a non-differentiable, discrete objective. Recently, a fully-differentiable MoE architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges experts in the parameter space; nevertheless, its effectiveness was only demonstrated in downstream fine-tuning on classification tasks. In this paper, we present Lory, the first approach that scales such architectures to autoregressive language model pre-training. Lory introduces two key techniques: (1) a causal segment routing strategy that achieves high efficiency for expert merging operations while preserving the autoregressive nature of language models; (2) a similarity-based data batching method that encourages expert specialization by grouping similar documents in training instances. We pre-train a series of Lory models on 150B tokens from scratch, with up to 32 experts and 30B (1.5B active) parameters. Experimental results show significant performance gains over parameter-matched dense models on both perplexity (+13.9%) and a variety of downstream tasks (+1.5%-11.1%). Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. We further demonstrate that the trained experts in Lory capture domain-level specialization without supervision. Our work highlights the potential of fully-differentiable MoE architectures for language model pre-training and advocates future research in this area.
翻译:混合专家(MoE)模型促进了高效扩展,但训练路由器网络引入了优化非可微、离散目标的挑战。近期提出了一种完全可微的MoE架构SMEAR(Muqeeth等人,2023),该架构在参数空间中软性地合并专家;然而,其有效性仅在分类任务的下游微调中得到验证。本文提出Lory,这是首个将此类架构扩展到自回归语言模型预训练的方法。Lory引入两项关键技术:(1)因果分段路由策略,在保持语言模型自回归特性的同时,实现专家合并操作的高效性;(2)基于相似性的数据分批方法,通过分组训练实例中的相似文档来促进专家专业化。我们从头开始在150B token上预训练一系列Lory模型,包含最多32个专家和30B(1.5B活跃)参数。实验结果表明,在困惑度(+13.9%)及多种下游任务(+1.5%-11.1%)上,相较于参数匹配的密集模型,性能显著提升。尽管采用分段级路由,Lory模型仍能达到与具有token级路由的最先进MoE模型相媲美的性能。我们进一步证明,Lory中训练后的专家能在无监督情况下捕获领域级专业化。本研究凸显了完全可微MoE架构在语言模型预训练中的潜力,并倡导该领域的未来研究。