PROBE: Co-Balancing Computation and Communication in MoE Inference via Real-Time Predictive Prefetching

Mixture-of-Experts models have become a dominant architecture for scaling Large Language Models by activating only a sparse subset of experts per token. However, latency-critical MoE inference faces a fundamental tension: while expert parallelism improves memory efficiency, it also amplifies execution stragglers. In real-world serving, continuous batching and diverse concurrent requests induce rapid semantic shifts, causing expert hotspots to migrate abruptly across GPUs and triggering the 'double penalty' of coupled computational skew and network congestion. We propose PROBE, an inference system that co-balances computation and communication in real time. PROBE introduces Continuous Lookahead Pipelining, which proactively predicts, plans, and prefetches for upcoming layers while keeping all control overheads off the critical path. PROBE consists of: (1) a Gate-Initialized Lookahead Predictor that distills the target router to forecast next-layer expert activation with high fidelity; (2) a Hardware-Aware Balance Planning solver that jointly optimizes dynamic expert replication and token assignment under strict hiding-window constraints; and (3) a Phase-Locked Co-Scheduling policy that uses split-phase transmission to hide bandwidth-intensive expert transfers behind computation without contending with All-to-All collectives. Experiments show that PROBE reduces prefill latency by up to 1.32X and improves decoding throughput by up to 1.26X over state-of-the-art baselines, especially under extreme workload volatility.

翻译：混合专家模型已成为扩展大型语言模型的主流架构，其核心机制是每个令牌仅激活稀疏的专家子集。然而，面向延迟敏感的MoE推理面临一个根本性矛盾：专家并行化在提升内存效率的同时，也会加剧执行滞后现象。在实际部署场景中，连续批处理与多样化的并发请求会引发快速的语义漂移，导致专家热点在GPU间突发迁移，从而触发计算倾斜与网络拥塞耦合的“双重惩罚”。本文提出PROBE推理系统，该系统能够实时协同平衡计算与通信。PROBE引入连续前瞻流水线技术，在保持所有控制开销脱离关键路径的前提下，主动预测、规划并预取即将处理的网络层。PROBE包含三个核心组件：（1）门控初始化前瞻预测器，通过蒸馏目标路由器的决策逻辑，高保真地预测下一层的专家激活状态；（2）硬件感知平衡规划求解器，在严格隐藏窗口约束下联合优化动态专家复制与令牌分配策略；（3）相位锁定协同调度策略，采用分阶段传输机制将带宽密集的专家数据传输隐藏在计算过程之后，且不与All-to-All集合通信产生资源竞争。实验表明，相较于现有最优基线方法，PROBE在极端工作负载波动场景下最高可将预填充延迟降低1.32倍，并将解码吞吐量提升1.26倍。