Mixture-of-Expert (MoE) models enable efficient inference by employing smaller experts and activating only a subset of them per token. MoE serving engines distribute experts across multiple GPUs and route tokens to appropriate GPUs at inference time based on experts activated. They process tokens in lock-step fashion, where tokens within a batch must finish processing before proceeding to the next layer. This synchronization barrier acts as a critical bottleneck because the performance of MoE models is limited by the straggler GPU that finishes last. Stragglers emerge when too many heavily used experts are placed on the same GPU or the slowest GPU. While prior works place experts that balance token loads across GPUs, they all overlook GPU variability and often place highly used experts on the slowest GPUs. We propose GEM, GPU-variability-aware Expert Mapping, a framework for GPU variability-aware expert to GPU mapping for MoE models. GEM exploits two insights. First, we must place experts such that each GPU receives non-uniform token loads based on their variability and they all finish processing a layer at about the same time. Our studies show that there are two types of experts: consistent that are used most of the time and temporal that are often used together for the remaining time. Our second insight is that we must place simultaneously used consistent and temporal experts on different GPUs and avoid placing them on slower GPUs to reduce slowdown. GEM gathers the variability profile of GPUs for each model and task and uses the token load distributions per task to map experts to GPUs. Our experiments show that GEM improves end-to-end latency by 7.9% on average and by up to 16.5% compared to the baseline.
翻译:摘要:混合专家模型通过使用较小的专家模块并仅为每个令牌激活其中一部分子集,实现高效推理。MoE服务引擎将专家分布到多个GPU上,并在推理时根据激活的专家将令牌路由至相应GPU。这些引擎采用同步处理方式,批次内的所有令牌必须完成当前层的处理才能进入下一层。这种同步屏障成为关键瓶颈,因为MoE模型的性能受限于最后一个完成计算的滞后GPU。当过多高负载专家被分配到同一GPU或最慢GPU时,就会产生滞后现象。虽然现有工作通过负载均衡将专家分布到不同GPU,但均未考虑GPU性能差异,往往将高频使用的专家分配到最慢的GPU上。我们提出GEM(GPU变体感知专家映射)框架,用于实现MoE模型的GPU变体感知专家-GPU映射。GEM基于两个关键发现:首先,专家分配时需根据各GPU性能差异为其分配非均匀令牌负载,使所有GPU能在相近时间内完成单层处理。研究表明存在两类专家:一致性专家(大部分时间被使用)和临时性专家(其余时间常被组合使用)。第二个发现是:必须将同时使用的一致性专家与临时性专家分配到不同GPU,并避免将其分配到较慢GPU以减少性能下降。GEM收集每个模型和任务的GPU性能分布特征,利用各任务的令牌负载分布进行专家-GPU映射。实验表明,相比基线方法,GEM将端到端延迟平均降低7.9%,最高达16.5%。