Selective parameter activation provided by Mixture-of-Expert (MoE) models have made them a popular choice in modern foundational models. However, MoEs face a fundamental tension when employed for serving. Batching, critical for performance in serving, forces the activation of all experts, thereby negating MoEs' benefits and exacerbating memory bandwidth bottlenecks. Existing work on efficient MoE inference are unable to resolve this tension even with extensive workload-specific tuning. We present LYNX, a system that enables efficient MoE inference in a workload-agnostic fashion. LYNX leverages a key property of MoE training: load-balancing losses introduce batch-level expert activation skews and redundancy, which it exploits by remapping low-affinity token-to-expert assignments within each batch using a novel AffinityBinning technique that reduces the total experts invoked. Our evaluation of LYNX on four state-of-the-art model families across nine benchmarks shows that it achieves up to 1.30x improvement in throughput while maintaining accuracy loss of less than 1% points across tasks. Further, LYNX is complementary to existing techniques where it additionally boosts their performance by up to 1.38x.
翻译:混合专家模型(Mixture-of-Expert, MoE)的选择性参数激活特性使其成为现代基础模型的热门选择。然而,当用于服务部署时,MoE面临根本性矛盾:对服务性能至关重要的批处理机制会强制激活所有专家,从而抵消MoE的优势并加剧内存带宽瓶颈。现有针对高效MoE推理的研究即便经过繁复的工作负载定制优化,仍无法化解这一矛盾。本文提出的LYNX系统能以工作负载无关的方式实现高效MoE推理。LYNX利用MoE训练的关键特性:负载均衡损失会在批量层面引入专家激活偏差与冗余。通过创新性的亲和度分箱(AffinityBinning)技术,LYNX在每批输入中重新映射低亲和度的token-专家分配方案,从而减少被调用的专家总数。我们在四个最先进模型家族、九个基准测试上的评估显示:LYNX在保持任务准确率损失低于1%的前提下,最多实现1.30倍的吞吐量提升。此外,LYNX能与现有技术互补,进一步将其性能提升最高达1.38倍。