Large Language Model (LLM) serving systems remain fundamentally fragile, where frequent hardware faults in hyperscale clusters trigger disproportionate service outages in the software stack. Current recovery mechanisms are prohibitively slow, often requiring up to 10 minutes to reinitialize resources and reload massive model weights. We introduce KevlarFlow, a fault tolerant serving architecture designed to bridge the gap between hardware unreliability and service availability. KevlarFlow leverages 1) decoupled model parallelism initialization, 2) dynamic traffic rerouting, and 3) background KV cache replication to maintain high throughput during partial failures. Our evaluation demonstrates that KevlarFlow reduces mean-time-to-recovery (MTTR) by 20x and, under failure conditions, improves average latency by 3.1x, 99th percentile (p99) latency by 2.8x, average time-to-first-token (TTFT) by 378.9x, and p99 TTFT by 574.6x with negligible runtime overhead in comparison to state-of-the-art LLM serving systems.
翻译:大语言模型(LLM)服务系统本质上仍十分脆弱,超大规模集群中频繁的硬件故障会引发软件栈中不成比例的服务中断。当前的恢复机制极其缓慢,通常需要长达10分钟来重新初始化资源并重新加载庞大的模型权重。我们提出了KevlarFlow,一种旨在弥合硬件不可靠性与服务可用性之间差距的容错服务架构。KevlarFlow利用以下技术:1)解耦的模型并行初始化,2)动态流量重路由,以及3)后台KV缓存复制,以在部分故障期间维持高吞吐量。我们的评估表明,与最先进的LLM服务系统相比,KevlarFlow将平均恢复时间(MTTR)降低了20倍,在故障条件下,将平均延迟提升了3.1倍,第99百分位(p99)延迟提升了2.8倍,平均首词元时间(TTFT)提升了378.9倍,p99 TTFT提升了574.6倍,且运行时开销可忽略不计。