Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

Xun Sun,Shaoyuan Chen,Pingchuan Ma,Yue Chen,Ziwei Yuan,Zhanhao Cao,Han Han,Shangming Cai,Teng Ma,Xuchun Shang,Xinpeng Zhao,Ke Yang,Junlin Wei,Lianzhi Lin,Yuji Liu,Feng Ren,Haoran Hu,Cheng Wan,Yingdi Shan,Yongwei Wu,Mingxing Zhang

Mixture-of-Experts (MoE) serving relies on wide expert parallelism (EP) to aggregate the memory capacity and bandwidth of many GPUs within one inference instance. This efficiency comes with a systems cost: every decoding step depends on token dispatch and combination across all active EP ranks, so even one rank failure can disrupt the entire service. Existing EP stacks handle such failures poorly because they treat membership as a fixed configuration established at initialization. The same rank set determines communicator state, expert placement, and the routing metadata baked into CUDA execution graphs, leaving the system with no way to shrink around a failure while keeping the instance valid. This paper argues that partial-failure tolerance should instead be formulated as a live EP validity problem. We present EEP, a communication and runtime substrate that represents membership as explicit, mutable runtime state. EEP repairs the specific state invalidated by a fault: it restores peer reachability without rebuilding the communication substrate, repairs lost expert coverage through a bandwidth-aware hierarchy, and reintegrates repaired ranks without forcing healthy ranks to recapture their CUDA graphs. We implement EEP in an EP serving stack integrated with SGLang and evaluate it under steady-state serving, failure recovery, and rank reintegration. The results show that explicit mutable membership preserves the steady-state fast path, staying within 4.4% of a fixed-membership DeepEP baseline under static serving, while turning a local rank fault from whole-instance downtime into two bounded interruptions. On a single-rank failure workload, EEP incurs an 11s recovery pause and an 8s reintegration pause, and restores throughput to within 95% of the pre-fault level within 52s, whereas a fixed-membership full-restart baseline remains unavailable until 348s.

翻译：混合专家（MoE）服务依赖宽专家并行（EP）来聚合一个推理实例内多个GPU的内存容量和带宽。这种效率伴随着系统代价：每个解码步骤都依赖于所有活跃EP秩间的令牌分发与组合，因此单个秩故障即可中断整个服务。现有EP栈对此类故障处理不善，因为它们将成员关系视为初始化时建立的固定配置。相同的秩集决定了通信器状态、专家放置以及嵌入CUDA执行图中的路由元数据，导致系统无法在保持实例有效性的同时围绕故障进行收缩。本文主张将部分故障容错重新构建为一个实时EP有效性（EP Validity）问题。我们提出EEP，一种将成员关系表示为显式可变的运行时状态的通信与运行时基座。EEP修复故障所失效的特定状态：在不重建通信基座的前提下恢复对等节点的可达性，通过带宽感知层次结构修复丢失的专家覆盖，并在不强制健康秩重新捕获其CUDA图的情况下重新整合修复后的秩。我们在与SGLang集成的EP服务栈中实现EEP，并在稳态服务、故障恢复和秩重新整合场景下进行评估。结果表明，显式可变的成员关系维持了稳态快速路径——在静态服务下与固定成员关系的DeepEP基线偏差在4.4%以内，同时将局部秩故障从整个实例停机转化为两次有界中断。在单秩故障工作负载中，EEP产生11秒恢复停顿和8秒重新整合停顿，并在52秒内将吞吐量恢复到故障前水平的95%以内，而固定成员关系的完全重启基线在348秒前始终不可用。