Mixture-of-experts (MoE) models facilitate efficient scaling; however, training the router network introduces the challenge of optimizing a non-differentiable, discrete objective. Recently, a fully-differentiable MoE architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges experts in the parameter space; nevertheless, its effectiveness was only demonstrated in downstream fine-tuning on classification tasks. In this paper, we present Lory, the first approach that scales such architectures to autoregressive language model pre-training. Lory introduces two key techniques: (1) a causal segment routing strategy that achieves high efficiency for expert merging operations while preserving the autoregressive nature of language models; (2) a similarity-based data batching method that encourages expert specialization by grouping similar documents in training instances. We pre-train a series of Lory models on 150B tokens from scratch, with up to 32 experts and 30B (1.5B active) parameters. Experimental results show significant performance gains over parameter-matched dense models on both perplexity (+13.9%) and a variety of downstream tasks (+1.5%-11.1%). Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. We further demonstrate that the trained experts in Lory capture domain-level specialization without supervision. Our work highlights the potential of fully-differentiable MoE architectures for language model pre-training and advocates future research in this area.
翻译:专家混合(MoE)模型有助于实现高效扩展,然而训练路由网络会带来优化不可微分离散目标的挑战。最近,一种全可微分的MoE架构SMEAR被提出(Muqeeth等人,2023),该架构在参数空间中软性融合专家;但其有效性仅在分类任务的下游微调中得到验证。本文提出Lory,首次将此类架构扩展至自回归语言模型预训练。Lory引入两项关键技术:(1)因果片段路由策略,在保持语言模型自回归特性的同时,实现专家融合操作的高效性;(2)基于相似性的数据批处理方法,通过将相似文档分组至训练实例来促进专家专业化。我们在1500亿词元上从头预训练了一系列Lory模型,最多包含32位专家和300亿(15亿激活)参数。实验结果显示,在困惑度(+13.9%)和多种下游任务(+1.5%-11.1%)上均显著优于参数规模相当的稠密模型。尽管采用片段级路由,Lory模型仍能达到与采用词元级路由的先进MoE模型相竞争的性能。我们进一步证明,Lory中训练完成的专家能够无监督地捕获领域级专业化特征。本研究凸显了全可微分MoE架构在语言模型预训练中的潜力,并倡导该领域的未来研究。