Mixture-of-Experts (MoE) has recently emerged as the mainstream architecture for efficiently scaling large language models while maintaining near-constant computational cost. Expert parallelism distributes parameters by partitioning experts across devices, but this introduces token-level load imbalance during inference. Expert replication is a widely adopted load-balancing technique in serving frameworks that alleviates load imbalance in large-scale deployments by replicating experts with high loads. In this work, we demonstrate that existing replication schemes often over-replicate, with many replicas providing marginal improvement. Replicas consume substantial GPU memory, which may lead to resource contention and throughput degradation. We present CRAFT, an efficient expert replication framework that maximizes load balance under a given memory budget by performing fine-grained, per-layer replication based on the estimated replication benefit. CRAFT can be seamlessly integrated into existing serving frameworks without any additional training or model changes. Our evaluation shows that CRAFT increases end-to-end serving throughput by $1.14\times$ on average (up to $1.2\times$) over existing replication techniques in large-scale deployments with models ranging from hundreds of billions to a trillion parameters.
翻译:混合专家模型(MoE)近期已成为在保持计算成本近乎恒定的同时高效扩展大型语言模型的主流架构。专家并行通过将专家模块分散部署到不同设备来分配参数,但这会导致推理过程中产生令牌级负载不均衡。专家复制是服务框架中广泛采用的负载均衡技术,通过复制高负载专家来缓解大规模部署中的负载失衡问题。在本研究中,我们证明现有复制方案往往存在过度复制现象,许多副本仅能带来边际改善。副本会消耗大量GPU内存,可能导致资源争用及吞吐量下降。我们提出CRAFT——一种高效的专家复制框架,该框架基于估算的复制收益进行细粒度逐层复制,在给定内存预算下最大化负载均衡。CRAFT可无缝集成至现有服务框架中,无需额外训练或修改模型。评估表明,在参数规模从数千亿到万亿不等的大型模型部署中,与现有复制技术相比,CRAFT使端到端服务吞吐量平均提升1.14倍(最高可达1.2倍)。