KV cache restoration has emerged as a dominant bottleneck in serving long-context LLM workloads, including multi-turn conversations, retrieval-augmented generation, and agentic pipelines. Existing approaches treat restoration as a per-request tradeoff between recomputation and I/O transfer, recomputing KV states from scratch or offloading them from external storage (e.g., CPU memory or remote machines). However, existing advances fail to exploit parallelism across tokens, layers, and distributed deployments, and critically ignore resource contention under batched serving. We present CacheFlow, a KV cache restoration framework that rethinks cache restoration as a multi-dimensional parallel execution problem. CacheFlow introduces a unified 3D parallelism abstraction across tokens, layers, and GPUs, enabling fine-grained overlap of recomputation and I/O along the structural dependencies of transformer inference. At the core of CacheFlow is a batch-aware two-pointer scheduler that jointly optimizes compute and I/O allocation across requests by prioritizing operations with the highest marginal reduction in recomputation cost. Our evaluations show that CacheFlow reduces Time-To-First-Token (TTFT) by 10%-62% over existing advances across diverse models, workloads, and hardware.
翻译:KV缓存恢复已成为服务长上下文LLM工作负载(包括多轮对话、检索增强生成和代理管道)的主要瓶颈。现有方法将缓存恢复视为每个请求中重计算与I/O传输之间的权衡,通常从零开始重计算KV状态或从外部存储(如CPU内存或远程机器)卸载缓存。然而,现有方案未能充分利用令牌、层和分布式部署间的并行性,且严重忽略了批处理服务下的资源竞争问题。我们提出CacheFlow——一种将缓存恢复重构为多维并行执行问题的KV缓存恢复框架。CacheFlow引入了跨令牌、层和GPU的统一3D并行抽象,可沿Transformer推理的结构依赖关系实现重计算与I/O的细粒度重叠。其核心是一种批感知双指针调度器,通过优先执行边际重计算成本降幅最高的操作,联合优化跨请求的计算与I/O分配。实验表明,在不同模型、工作负载和硬件平台上,CacheFlow相比现有方案可将首令牌响应时间(TTFT)降低10%-62%。