Serving large language models (LLMs) is important for cloud providers, and caching intermediate results (KV\$) after processing each request substantially improves serving throughput and latency. However, there is limited understanding of how LLM serving benefits from KV\$ caching, where system design decisions like cache eviction policies are highly workload-dependent. In this paper, we present the first systematic characterization of the KV\$ workload patterns from one of the leading LLM service providers. We draw observations that were not covered by previous studies focusing on synthetic workloads, including: KV\$ reuses are skewed across requests, where reuses between single-turn requests are equally important as multi-turn requests; the reuse time and probability are diverse considering all requests, but for a specific request category, the pattern tends to be predictable; and the overall cache size required for an ideal cache hit ratio is moderate. Based on the characterization, we further propose a workload-aware cache eviction policy that improves the serving performance under real-world traces, especially with limited cache capacity.
翻译:大型语言模型(LLM)服务对云服务提供商至关重要,而缓存处理每个请求后产生的中间结果(KV$)能显著提升服务吞吐量并降低延迟。然而,目前对于LLM服务如何从KV$缓存中获益仍缺乏深入理解,诸如缓存淘汰策略等系统设计决策高度依赖于实际工作负载。本文首次基于领先的LLM服务提供商的真实数据,系统性地刻画了KV$工作负载模式。我们发现了先前基于合成负载的研究未涵盖的现象:KV$在请求间的复用呈现偏态分布,其中单轮对话请求间的复用与多轮对话请求同等重要;从整体请求来看,复用时间与概率具有多样性,但对于特定请求类别,其模式往往可预测;实现理想缓存命中率所需的总体缓存容量处于中等水平。基于这些特征,我们进一步提出一种工作负载感知的缓存淘汰策略,该策略在真实场景追踪数据下——尤其在缓存容量受限时——能有效提升服务性能。