Large Language Model (LLM) has revolutionized the field of artificial intelligence, with their capabilities expanding rapidly due to advances in deep learning and increased computational resources. The mixture-of-experts (MoE) model has emerged as a prominent architecture in the field of LLM, better balancing the model performance and computational efficiency. MoE architecture allows for effective scaling and efficient parallel processing, but the GEMM (General Matrix Multiply) of MoE and the large parameters introduce challenges in terms of computation efficiency and communication overhead, which becomes the throughput bottleneck during inference. Applying a single parallelism strategy like EP, DP, PP, etc. to MoE architecture usually achieves sub-optimal inference throughput, the straightforward combinations of existing different parallelisms on MoE can not obtain optimal inference throughput yet. This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that goes beyond the existing inference parallelism schemes. Our approach focuses on optimizing the computation of MoE FFN (FeedForward Network) modules by dynamically selecting the best kernel implementation of GroupGemm and DenseGemm for different loads and adaptively overlapping these computations with \textit{all2all} communication, leading to a substantial increase in throughput. Our experimental results demonstrate an average 21% improvement in prefill throughput over existing parallel inference methods. Specifically, we validated our method on DeepSeekV2, a highly optimized model claimed to achieve a prefill throughput of 100K tokens per second. By applying EPS-MoE, we further accelerated it to at least 120K tokens per second.
翻译:大语言模型(LLM)已彻底改变人工智能领域,其能力因深度学习进展与计算资源增长而迅速扩展。专家混合(MoE)模型已成为LLM领域的重要架构,能更好地平衡模型性能与计算效率。MoE架构支持高效扩展与并行处理,但其通用矩阵乘法(GEMM)运算与庞大参数量在计算效率与通信开销方面带来挑战,成为推理过程中的吞吐量瓶颈。将专家并行(EP)、数据并行(DP)、流水线并行(PP)等单一并行策略直接应用于MoE架构通常只能获得次优推理吞吐量,现有不同并行方案的简单组合亦无法实现最优吞吐性能。本文提出EPS-MoE,一种超越现有推理并行方案的新型MoE专家流水线调度器。该方法通过动态选择GroupGemm与DenseGemm的最佳核实现以适应不同负载,并自适应地将这些计算与\textit{all2all}通信重叠执行,从而显著提升MoE前馈网络(FFN)模块的计算效率。实验结果表明,相比现有并行推理方法,预填充吞吐量平均提升21%。特别地,我们在宣称可实现每秒10万token预填充吞吐量的深度优化模型DeepSeekV2上验证了本方法,应用EPS-MoE后其吞吐量进一步提升至至少每秒12万token。