MoE facilitates the development of large models by making the computational complexity of the model no longer scale linearly with increasing parameters. The learning sparse gating network selects a set of experts for each token to be processed; however, this may lead to differences in the number of tokens processed by each expert over several successive iterations, i.e., the expert load fluctuations, which reduces computational parallelization and resource utilization. To this end, we traced and analyzed loads of each expert in the training iterations for several large language models in this work, and defined the transient state with "obvious load fluctuation" and the stable state with "temporal locality". Moreover, given the characteristics of these two states and the computational overhead, we deployed three classical prediction algorithms that achieve accurate expert load prediction results. For the GPT3 350M model, the average error rates for predicting the expert load proportion over the next 1,000 and 2,000 steps are approximately 1.3% and 1.8%, respectively. This work can provide valuable guidance for expert placement or resource allocation for MoE model training. Based on this work, we will propose an expert placement scheme for transient and stable states in our coming work.
翻译:专家混合(MoE)通过使模型计算复杂度不再随参数增加而线性增长,促进了大规模模型的发展。学习型稀疏门控网络为每个待处理令牌选择一组专家,但这可能导致连续多次迭代中每个专家处理的令牌数量存在差异,即专家负载波动,从而降低计算并行化与资源利用率。为此,本研究追踪并分析了多个大语言模型训练迭代中每个专家的负载情况,定义了具有“明显负载波动”的瞬态和具有“时间局部性”的稳态。此外,针对这两种状态的特征及计算开销,我们部署了三种经典预测算法,实现了准确的专家负载预测结果。对于GPT3 350M模型,预测未来1000步和2000步专家负载比例的误差率分别约为1.3%和1.8%。本研究可为MoE模型训练中的专家部署或资源分配提供有价值的指导。基于此工作,我们将在后续研究中提出针对瞬态与稳态的专家部署方案。