The rise of million-token, agent-based applications has placed unprecedented demands on large language model (LLM) inference services. The long-running nature of these tasks increases their susceptibility to hardware and software faults, leading to costly job failures, wasted resources, and degraded user experience. The stateful key-value (KV) cache, which grows with the sequence length, presents a central challenge as it is a critical and vulnerable component in distributed serving systems. In this work, we propose GhostServe, a novel checkpointing solution to facilitate fault-tolerant LLM serving. Specifically, GhostServe protects the streaming KV cache in the shadow by applying erasure coding to generate and store the parity shards in host memory. In the event of device failures, GhostServe enables fast reconstruction of the lost KV cache, allowing the inference process to resume seamlessly without costly full recomputation or state replication. Evaluations demonstrate that GhostServe reduces checkpointing latency by up to 2.7x and recovery latency by 2.1x for a single batch, and 1.2x median response latency compared to existing methods, in the presence of system failures, paving the way for high-availability and cost-effective LLM serving at scale.
翻译:百万级令牌的智能体应用程序的兴起,对大型语言模型(LLM)推理服务提出了前所未有的需求。这类任务的长时运行特性使其更易受到硬件与软件故障的影响,导致代价高昂的任务失败、资源浪费和用户体验下降。具有状态特性的键值(KV)缓存随序列长度增长而扩展,作为分布式服务系统中的关键脆弱组件,构成了核心挑战。本研究提出GhostServe——一种新型检查点解决方案,旨在支持容错型LLM服务。具体而言,GhostServe通过应用纠删码在影子中保护流式KV缓存,在主机内存中生成并存储奇偶校验分片。当设备发生故障时,GhostServe可快速重建丢失的KV缓存,使推理过程无需代价高昂的完整重计算或状态复制即可无缝恢复。评估表明,与现有方法相比,GhostServe在系统故障场景下可将单批次检查点延迟降低最多2.7倍,恢复延迟降低2.1倍,中位响应延迟降低1.2倍,为大规模高可用、经济高效的LLM服务铺平道路。