ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving

In distributed Mixture-of-Experts (MoE) inference, input-dependent token routing interacts with GPU performance variability to create persistent stragglers under synchronized execution, where the slowest GPU determines layer latency. This performance variability is inherent to modern accelerators: manufacturing variation, power limits, and thermal conditions introduce measurable execution-time differences across nominally identical GPUs. The core challenge is that MoE execution-time imbalance arises from the interaction of workload skew and hardware asymmetry. Token routing produces uneven and layer-varying expert loads, while GPU throughput depends on device-specific operating characteristics and workload intensity. Prior work mitigates routing skew but assumes homogeneous hardware, optimizing token balance rather than execution latency. As a result, even balanced token assignments can leave hardware-induced stragglers unaddressed. Thus, we propose Variability-Informed Binning of Experts (ViBE), a hardware-aware expert placement framework that minimizes execution-time imbalance across GPUs. ViBE combines per-GPU performance modeling with expert activation profiling to assign high-load experts to faster devices and low-load experts to slower ones, reducing layer-level stragglers without modifying model semantics or hardware. Because both workload characteristics and effective GPU throughput can shift across serving conditions, ViBE supports lightweight recalibration under workload/performance drift to refresh its routing and performance estimates when needed. Results show that ViBE consistently reduces execution-time imbalance and improves SLO attainment by 14%, while lowering P90 TTFT by up to 45%. We further show that the impact of hardware variability increases at scale, making variability-aware placement important for efficient, high-utilization LLM serving.

翻译：摘要：在分布式混合专家模型推理中，依赖输入的令牌路由机制与GPU性能差异相互作用，在同步执行模式下产生持续性拖尾节点——此时最慢GPU决定层级延迟。这种性能差异是现代加速器的固有特征：制造工艺偏差、功耗限制与热力学条件导致标称相同的GPU间存在可测量的执行时间差异。核心挑战在于，混合专家模型执行时间失衡源于负载偏斜与硬件异构性的交互作用：令牌路由产生不均匀且逐层变化的专家负载，而GPU吞吐量取决于设备特定运行特征与负载强度。现有工作缓解路由偏斜但假设同质化硬件，优化目标是令牌平衡而非执行延迟。因此，即使均衡的令牌分配仍可能遗留硬件引发的拖尾节点问题。为此，我们提出硬件感知的专家分布框架——差异感知专家装箱法（ViBE），旨在最小化GPU间的执行时间失衡。该框架通过结合单GPU性能建模与专家激活特征分析，将高负载专家分配至更快的设备，低负载专家分配至较慢的设备，在不修改模型语义或硬件的前提下减少层级拖尾节点。由于负载特征与有效GPU吞吐量可能随服务条件动态偏移，ViBE支持在负载/性能漂移时通过轻量级重校准及时更新路由决策与性能评估。实验表明，ViBE持续降低执行时间失衡，将服务等级协议达成率提升14%，同时使第90百分位的令牌生成时间降低最高45%。我们进一步验证，硬件差异的影响随规模扩大而加剧，这使得差异感知的专家分布对实现高效、高利用率的大语言模型服务至关重要。