Mixture-of-Experts (MoE) layers have emerged as an important tool in scaling up modern neural networks by decoupling total trainable parameters from activated parameters in the forward pass for each token. However, sparse MoEs add complexity to training due to (i) new trainable parameters (router weights) that, like all other parameter groups, require hyperparameter (HP) tuning; (ii) new architecture scale dimensions (number of and size of experts) that must be chosen and potentially taken large. To make HP selection cheap and reliable, we propose a new parameterization for transformer models with MoE layers when scaling model width, depth, number of experts, and expert (hidden) size. Our parameterization is justified by a novel dynamical mean-field theory (DMFT) analysis. When varying different model dimensions trained at a fixed token budget, we find empirically that our parameterization enables reliable HP transfer across models from 51M to over 2B total parameters. We further take HPs identified from sweeping small models on a short token horizon to train larger models on longer horizons and report performant model behaviors.
翻译:专家混合层已成为扩展现代神经网络的重要工具,它通过解耦总可训练参数与每个令牌前向传播中的激活参数来实现这一目标。然而,稀疏MoE增加了训练复杂性,原因在于:(i) 新的可训练参数(路由器权重)与其他所有参数组一样,需要进行超参数调优;(ii) 新的架构规模维度(专家数量与规模)必须被选择且可能需要设置得较大。为实现低成本且可靠的超参数选择,我们针对包含MoE层的Transformer模型,在扩展模型宽度、深度、专家数量及专家(隐藏层)规模时,提出了一种新的参数化方法。该参数化的合理性通过一种新颖的动态平均场理论分析得以论证。在固定令牌预算下训练不同模型维度时,我们通过实证发现,我们的参数化方法能够在总参数量从5100万到超过20亿的模型之间实现可靠的超参数迁移。我们进一步利用在短令牌序列上对小模型进行扫描得到的超参数,在长序列上训练更大规模的模型,并报告了其表现优异的模型行为。