Efficiently serving large language models (LLMs) under dynamic and bursty workloads remains a key challenge for real-world deployment. Existing serving frameworks and static model compression techniques fail to adapt to workload fluctuations, leading to either service-level objective (SLO) violations under full-precision serving or persistent accuracy degradation with static quantization. We present MorphServe, a dynamic, workload-aware LLM serving framework based on morphological adaptation. MorphServe introduces two asynchronous, token-level runtime mechanisms: quantized layer swapping, which selectively replaces less impactful layers with quantized alternatives during high-load periods, and pressure-aware KV cache resizing, which dynamically adjusts KV cache capacity in response to memory pressure. These mechanisms enable state-preserving transitions with minimum runtime overhead and are fully compatible with modern scheduling and attention techniques. Extensive experiments on Vicuna and Llama family models with real-world workloads demonstrate that MorphServe reduces average SLO violations by 92.45 percent and improves the P95 TTFT latency by 2.2x-3.9x compared to full-precision serving, without compromising generation quality. These results establish MorphServe as a practical and elastic solution for LLM deployment in dynamic environments.
翻译:在动态且突发的工作负载下高效服务大型语言模型(LLM)仍然是实际部署中的关键挑战。现有的服务框架和静态模型压缩技术无法适应工作负载的波动,导致要么在全精度服务时违反服务级别目标(SLO),要么在静态量化下持续存在精度下降问题。我们提出了MorphServe,一个基于形态学自适应、动态且负载感知的LLM服务框架。MorphServe引入了两种异步的、令牌级别的运行时机制:量化层交换,它能在高负载期间选择性地将影响较小的层替换为量化版本;以及压力感知的KV缓存容量调整,它能根据内存压力动态调整KV缓存的容量。这些机制能够以最小的运行时开销实现状态保持的转换,并且完全兼容现代调度与注意力技术。在Vicuna和Llama系列模型上使用真实工作负载进行的广泛实验表明,与全精度服务相比,MorphServe将平均SLO违规率降低了92.45%,并将P95首令牌延迟(TTFT)提升了2.2倍至3.9倍,同时不损害生成质量。这些结果确立了MorphServe作为动态环境中LLM部署的一个实用且具有弹性的解决方案。