Efficient inference with Large Language Models (LLMs) increasingly relies on Key-Value (KV) caches to store previously computed key and value vectors at each layer. These caches are essential to minimize redundant computation during autoregressive token generation, lowering computational complexity from quadratic to linear. However, the growth of KV caches has posed significant system-level challenges, particularly as model sizes increase, context lengths grow, and concurrent requests compete for limited memory resources. Even though several recent frameworks for KV cache management have emerged, their comparative trade-offs in memory consumption and inference performance have not been fully understood, especially under varying request sizes and model configurations. In this work, we conduct an empirical study of three state-of-the-art KV cache management frameworks: vLLM, InfiniGen, and H2O. These frameworks employ techniques such as tensor offloading, token eviction heuristics, and speculative scheduling to balance memory usage and performance. We evaluate their performance in terms of a range of metrics such as latency, throughput, and memory usage across a spectrum of key parameters including request rates, model sizes, and sparsity levels. Our results pinpoint the conditions for each framework to perform the best, revealing the most suitable selection and configuration of KV cache strategies under memory and performance constraints.
翻译:大型语言模型(LLMs)的高效推理日益依赖键-值(KV)缓存,用于存储各层中先前计算出的键向量和值向量。这些缓存对于在自回归式词元生成过程中最小化冗余计算至关重要,可将计算复杂度从二次方降至线性。然而,KV缓存的增长带来了显著的系统级挑战,尤其是在模型规模扩大、上下文长度增加以及并发请求竞争有限内存资源的情况下。尽管近期出现了若干针对KV缓存管理的框架,但它们在内存消耗和推理性能方面的比较性权衡仍未得到充分理解,尤其在请求规模与模型配置多样化的场景下。本研究对三种前沿KV缓存管理框架——vLLM、InfiniGen与H2O——进行了实证分析。这些框架采用了张量卸载、词元驱逐启发式策略及推测性调度等技术,以平衡内存使用与性能。我们围绕请求速率、模型规模与稀疏度等一系列关键参数,从延迟、吞吐量及内存占用等多项指标对其性能进行了评估。实验结果明确了各框架的最优运行条件,揭示了在内存与性能约束下KV缓存策略的最适选择与配置方案。