A Replicate-and-Quantize Strategy for Plug-and-Play Load Balancing of Sparse Mixture-of-Experts LLMs

Sparse Mixture-of-Experts (SMoE) architectures are increasingly used to scale large language models efficiently, delivering strong accuracy under fixed compute budgets. However, SMoE models often suffer from severe load imbalance across experts, where a small subset of experts receives most tokens while others are underutilized. Prior work has focused mainly on training-time solutions such as routing regularization or auxiliary losses, leaving inference-time behavior, which is critical for deployment, less explored. We present a systematic analysis of expert routing during inference and identify three findings: (i) load imbalance persists and worsens with larger batch sizes, (ii) selection frequency does not reliably reflect expert importance, and (iii) overall expert workload and importance can be estimated using a small calibration set. These insights motivate inference-time mechanisms that rebalance workloads without retraining or router modification. We propose Replicate-and-Quantize (R&Q), a training-free and near-lossless framework for dynamic workload rebalancing. In each layer, heavy-hitter experts are replicated to increase parallel capacity, while less critical experts and replicas are quantized to remain within the original memory budget. We also introduce a Load-Imbalance Score (LIS) to measure routing skew by comparing heavy-hitter load to an equal allocation baseline. Experiments across representative SMoE models and benchmarks show up to 1.4x reduction in imbalance with accuracy maintained within +/-0.6%, enabling more predictable and efficient inference.

翻译：稀疏专家混合（SMoE）架构正被日益广泛地用于高效扩展大语言模型，在固定计算预算下提供强大的准确性。然而，SMoE模型通常存在专家间严重的负载不均衡问题，即一小部分专家接收了绝大多数令牌，而其他专家则未被充分利用。先前的研究主要集中在训练阶段的解决方案，例如路由正则化或辅助损失函数，而对部署至关重要的推理阶段行为则探索较少。我们对推理过程中的专家路由进行了系统分析，并得出三点发现：(i) 负载不均衡持续存在，且随着批次增大而加剧；(ii) 选择频率并不能可靠地反映专家重要性；(iii) 可以使用一个小的校准集来估计整体的专家工作负载和重要性。这些发现启发了我们设计推理阶段的机制，以在不重新训练或修改路由器的前提下重新平衡工作负载。我们提出了复制与量化（R&Q），这是一个无需训练且近乎无损的动态工作负载再平衡框架。在每一层中，对负载过重的“热点”专家进行复制以增加并行处理能力，同时对重要性较低的专家及其副本进行量化，以保持在原始内存预算内。我们还引入了一个负载不均衡分数（LIS），通过比较热点专家的负载与平均分配基准来衡量路由的偏斜程度。在多个代表性SMoE模型和基准测试上的实验表明，该方法能将不均衡程度降低高达1.4倍，同时将准确率保持在±0.6%以内，从而实现更可预测和高效的推理。