Mixture-of-Experts models have become a dominant architecture for scaling Large Language Models by activating only a sparse subset of experts per token. However, latency-critical MoE inference faces a fundamental tension: while expert parallelism improves memory efficiency, it also amplifies execution stragglers. In real-world serving, continuous batching and diverse concurrent requests induce rapid semantic shifts, causing expert hotspots to migrate abruptly across GPUs and triggering the 'double penalty' of coupled computational skew and network congestion. We propose PROBE, an inference system that co-balances computation and communication in real time. PROBE introduces Continuous Lookahead Pipelining, which proactively predicts, plans, and prefetches for upcoming layers while keeping all control overheads off the critical path. PROBE consists of: (1) a Gate-Initialized Lookahead Predictor that distills the target router to forecast next-layer expert activation with high fidelity; (2) a Hardware-Aware Balance Planning solver that jointly optimizes dynamic expert replication and token assignment under strict hiding-window constraints; and (3) a Phase-Locked Co-Scheduling policy that uses split-phase transmission to hide bandwidth-intensive expert transfers behind computation without contending with All-to-All collectives. Experiments show that PROBE reduces prefill latency by up to 1.32X and improves decoding throughput by up to 1.26X over state-of-the-art baselines, especially under extreme workload volatility.
翻译:混合专家模型已成为扩展大型语言模型的主流架构,其通过仅为每个令牌激活稀疏的专家子集来实现。然而,面向低延迟的MoE推理面临一个根本性矛盾:专家并行虽提升了内存效率,却也加剧了执行滞后现象。在实际服务场景中,连续批处理与多样化的并发请求引发快速的语义漂移,导致专家热点在GPU间急剧迁移,并触发计算倾斜与网络拥塞耦合的“双重惩罚”。我们提出PROBE,一个能够实时协同平衡计算与通信的推理系统。PROBE引入了连续前瞻流水线技术,该技术在保持所有控制开销不进入关键路径的前提下,主动预测、规划并预取即将处理的层。PROBE包含三个核心组件:(1) 门控初始化前瞻预测器,通过蒸馏目标路由器的知识,以高保真度预测下一层的专家激活状态;(2) 硬件感知平衡规划求解器,在严格的隐藏窗口约束下,联合优化动态专家复制与令牌分配策略;(3) 相位锁定协同调度策略,利用分阶段传输将带宽密集的专家数据传输隐藏在计算过程之后,且不与All-to-All集合通信产生竞争。实验表明,相较于最先进的基线方法,PROBE在预填充阶段最高可降低1.32倍延迟,并在解码阶段最高提升1.26倍吞吐量,在极端工作负载波动下表现尤为显著。