Large Language Models (LLMs) are increasingly deployed as Internet/Web services (LLM-as-a-Service) with strict latency Service-Level Objectives (SLOs) under tight GPU memory budgets. Mixture-of-Experts (MoE) models improve quality and throughput via sparse expert activation, but serving them efficiently is challenging because expert weights dominate memory footprint and incur costly host--device transfers when offloaded. Moreover, MoE serving exhibits a phase disparity: the prefill phase tends to activate experts densely across many tokens, while the decode phase activates only a few experts per step. A uniform expert loading/caching policy across phases leads to either peak-memory blowup (prefill) or tail-latency inflation (decode). We present DuoServe-MoE, a QoS-oriented MoE serving system that decouples prefill and decode and applies phase-specialized expert scheduling. For prefill, DuoServe-MoE uses a two-stream CUDA pipeline to overlap expert prefetching with non-MoE computation, reducing expert residency time and peak GPU memory. For decode, it employs a lightweight layer-level predictor trained offline from activation traces to prefetch only likely experts without model changes. Experiments on representative MoE LLMs show that DuoServe-MoE improves TTFT by up to $5.34\times$ and end-to-end latency by up to $7.55\times$ over representative baselines, while maintaining low runtime GPU memory usage under resource-constrained deployment.
翻译:大型语言模型(LLM)正日益部署为互联网/Web服务(LMaaS),在紧张的GPU内存预算下需满足严格的延迟服务等级目标(SLO)。混合专家(MoE)模型通过稀疏专家激活提升质量与吞吐量,但其服务面临挑战:专家权重主导内存占用,卸载时需高昂的主机-设备传输。此外,MoE服务呈现阶段差异:预填充阶段倾向于跨多个token密集激活专家,而解码阶段每步仅激活少量专家。采用统一的专家加载/缓存策略会导致峰值内存激增(预填充阶段)或尾延迟膨胀(解码阶段)。我们提出面向QoS的MoE服务系统DuoServe-MoE,其解耦预填充与解码阶段,并采用阶段专用专家调度。对于预填充,DuoServe-MoE采用双流CUDA流水线,将专家预取与非MoE计算重叠,减少专家驻留时间与GPU峰值内存。对于解码,其使用基于激活轨迹离线训练的轻量级层级预测器,无需模型改动即可预取可能激活的专家。在代表性MoE大模型上的实验表明,相比基线方法,DuoServe-MoE将TTFT提升高达5.34倍,端到端延迟提升高达7.55倍,同时在资源受限部署下维持低运行时GPU内存使用。