Modern large language models (LLMs) drive interactive AI systems but are bottlenecked by the memory-heavy growth of key-value (KV) caches, which limits real-time throughput under concurrent loads. Existing KV-cache compression methods rely on rigid heuristics, disrupt tensor layouts, or require specialized compute, hindering scalability and deployment. We propose joint encoding of KV-cache blocks, which fuses similar blocks across requests and input chunks into shared representations while preserving standard cache structure. This alleviates the KV-cache memory bottleneck, supporting high-concurrency serving without specialized hardware. Theoretically, we analyze the rate-distortion tradeoff of fused cache blocks under a Poisson process model. Empirically, our method achieves up to 4.38 $\times$ KV-cache compression with negligible accuracy loss across diverse LLMs and benchmarks, outperforming recent structured and adaptive compression baselines. In real LLM serving, joint encoding improves the token throughput by $\sim$40\% on a single-machine vLLM benchmark, demonstrating substantial gains in inference throughput. Code is available at https://github.com/sef1/kv_fast_fusion kv_joint_encoding.
翻译:现代大语言模型驱动着交互式人工智能系统,但其性能受限于键值缓存内存密集型增长带来的瓶颈,这制约了并发负载下的实时吞吐量。现有的KV-cache压缩方法依赖僵化的启发式策略、破坏张量布局或需要专用计算单元,阻碍了可扩展性与实际部署。我们提出KV-cache块联合编码方法,该方法将跨请求和输入块的相似缓存块融合为共享表示,同时保持标准缓存结构。这缓解了KV-cache内存瓶颈,可在无需专用硬件的条件下支持高并发服务。理论上,我们基于泊松过程模型分析了融合缓存块的率失真权衡关系。实验表明,本方法在多样化大语言模型与基准测试中实现了高达4.38倍的KV-cache压缩率,且精度损失可忽略不计,其性能优于近期提出的结构化与自适应压缩基线方法。在实际大语言模型服务中,联合编码在单机vLLM基准测试中使令牌吞吐量提升约40%,显著提高了推理吞吐量。代码发布于https://github.com/sef1/kv_fast_fusion kv_joint_encoding。