Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling

Prevailing LLM serving engines employ expert parallelism (EP) to implement multi-device inference of massive MoE models. However, the efficiency of expert parallel inference is largely bounded by inter-device communication, as EP embraces expensive all-to-all collectives to route tokens to the remote experts if not collocating on the same GPU/NPU device. Nevertheless, state-of-the-art schemes treat expert device-placement and request (or token) device-scheduling as separate concerns, triggering excessive communication between them and compromising inference efficiency This paper proposes Semantic Parallelism, a novel parallelism paradigm that minimizes the steep communication costs in EP-centric MoE serving via model-data collaborative scheduling. We implement Semantic Parallelism in a framework called Sem-MoE. Sem-MoE maximally collocates experts and their activating tokens onto the same device using proactively modeled activation likelihood between them and introduces three key techniques: (1) Offline model scheduling, which preliminarily clusters and collocates experts onto devices based on their co-activation tendencies for certain classes of input. (2) Online inter-request data scheduling for Attention-DP setups, which proactively rebatches incoming requests onto the device that hosts experts most likely and frequently activated by the corresponding requests. (3) Online intra-request data scheduling for Attention-TP setups, which seamlessly fuses a token reshuffling procedure into the original inference pipeline and proactively reschedules tokens to devices to reduce dispersed remote routing. We build Sem-MoE into a prevailing LLM serving engine SGLANG. Experiments show our collaborative scheduling approach can effectively reduce the all-to-all communication volume in EP and achieve superior inference throughput compared to existing solutions.

翻译：当前主流的大语言模型服务引擎采用专家并行（EP）来实现大规模MoE模型的多设备推理。然而，专家并行推理的效率在很大程度上受限于设备间通信，因为当专家未部署在同一GPU/NPU设备时，EP需采用昂贵的全连接集体操作将令牌路由至远程专家。现有先进方案将专家设备放置与请求（或令牌）设备调度视为独立问题，导致二者间产生过量通信并损害推理效率。本文提出语义并行——一种通过模型-数据协同调度来最小化以EP为中心的MoE服务中高昂通信成本的新型并行范式。我们在名为Sem-MoE的框架中实现了语义并行。该框架通过主动建模专家与其激活令牌间的激活可能性，最大程度地将专家及其激活令牌共置于同一设备，并引入三项关键技术：（1）离线模型调度：根据专家针对特定类别输入的协同激活趋势，对其进行初步聚类并共置于设备；（2）面向Attention-DP配置的在线请求间数据调度：主动将传入请求重新批处理至最可能且最频繁激活对应请求专家的设备；（3）面向Attention-TP配置的在线请求内数据调度：将令牌重排过程无缝融入原始推理流水线，主动将令牌重新调度至设备以减少分散的远程路由。我们将Sem-MoE集成至主流大语言模型服务引擎SGLANG中。实验表明，相较于现有方案，我们的协同调度方法能有效降低EP中的全连接通信量，并获得更优的推理吞吐量。