Serving large Mixture-of-Experts (MoE) models is challenging because of their large memory footprints, heterogeneous resource demands, and highly dynamic inference workloads. Most existing MoE inference systems deploy the entire model as a monolithic unit, forcing attention and MoE layers to share the same resource configuration despite their different scaling behaviors and resource bottlenecks. Such coarse-grained provisioning leads to resource inefficiency and suboptimal performance. We present JANUS, a scalable and resource-efficient MoE inference system built around three key principles. First, JANUS disaggregates attention and MoE layers onto separate GPU worker pools, enabling independent resource provisioning for the two layer types, and uses an adaptive two-phase communication mechanism for low-latency data exchange. Second, because MoE-layer execution is often memory-bound and highly sensitive to activated-expert imbalance, JANUS introduces a lightweight, microsecond-scale activation scheduler that balances per-layer activated experts across MoE instances to reduce inference latency. Third, JANUS employs a fine-grained, SLO-aware resource scaling scheme that jointly selects attention resources, MoE resources, and expert placement to minimize GPU cost under token-level SLOs. Evaluation shows that JANUS improves per-GPU throughput by up to 4.7x over state-of-the-art MoE inference baselines while satisfying token-level latency SLOs.
翻译:服务大规模混合专家(MoE)模型极具挑战性,原因在于其庞大的内存占用、异构资源需求以及高度动态的推理负载。现有MoE推理系统大多将整个模型部署为单一单元,强制注意力层与MoE层共享相同的资源配置,忽视了二者不同的扩展行为与资源瓶颈。这种粗粒度的资源供给方式导致资源利用效率低下和性能次优。本文提出JANUS,一种基于三项核心原则构建的可扩展、资源高效的MoE推理系统。首先,JANUS将注意力层与MoE层解耦至独立的GPU工作池,实现两类层资源的独立配置,并采用自适应两阶段通信机制实现低延迟数据交换。其次,针对MoE层执行常受内存限制且对激活专家分布不均高度敏感的问题,JANUS引入轻量级微秒级激活调度器,通过均衡各层MoE实例间的激活专家数量来降低推理延迟。第三,JANUS采用细粒度、SLO感知的资源扩展方案,联合选择注意力资源、MoE资源及专家放置策略,在满足Token级SLO约束下最小化GPU成本。评估表明,相比现有顶尖MoE推理基线,JANUS在满足Token级延迟SLO的同时,可将单GPU吞吐量提升4.7倍。