Large Language Model (LLM) serving must meet stringent Service Level Objectives (SLOs) for both the prefill and decode phases. Some existing solutions disaggregate the two phases, causing potential resource idleness or compute redundancy. Others split the prefill phase into chunks and fuse it with decode iteration, creating a dilemma between SLO compliance and high utilization. To address these issues, an efficient serving system should dynamically adapt compute allocation, decouple compute from memory management, and execute prefill and decode independently. We present MuxWise, an LLM serving framework that adopts a new paradigm, intra-GPU prefill-decode multiplexing, to meet these requirements. To fully exploit the paradigm, MuxWise integrates a bubble-less multiplex engine, a contention-tolerant estimator, and an SLO-aware dispatcher. Evaluation shows that MuxWise improves peak throughput under SLO guarantees by an average of 2.20x (up to 3.06x) over state-of-the-art baselines.
翻译:大型语言模型(LLM)服务必须满足预填充和解码两个阶段的严格服务水平目标(SLO)。现有的一些解决方案将这两个阶段解耦,可能导致资源闲置或计算冗余。其他方案则将预填充阶段分割成块并与解码迭代融合,这在SLO合规性与高利用率之间造成了困境。为解决这些问题,一个高效的服务系统应能动态调整计算资源分配,将计算与内存管理解耦,并独立执行预填充和解码。我们提出了MuxWise,一个采用新范式——GPU内预填充-解码多路复用——的LLM服务框架,以满足这些需求。为充分发挥该范式的优势,MuxWise集成了一个无气泡的多路复用引擎、一个容忍争用的估计器以及一个SLO感知的调度器。评估结果表明,在保证SLO的前提下,MuxWise相比最先进的基线方法,将峰值吞吐量平均提升了2.20倍(最高可达3.06倍)。