Large language model (LLM) applications often reuse previously processed context, such as chat history and documents, which introduces significant redundant computation. Existing LLM serving systems address such redundant computation by storing the KV caches of processed context and loading the corresponding KV cache when a new request reuses the context. Further, as these LLM applications scale, the total size of KV caches becomes excessively large and requires both DRAM and SSD for full storage. However, prior work that stores KV caches in DRAM and SSD suffers from high loading delays, as most KV cache hits come from SSD, which is slow to load. To increase the KV cache hit rate on DRAM, we identify lossy KV cache compression as a promising approach. We design a lossy compression system that decides the compression algorithm, compression rate and device placement for each KV cache entry to maximise DRAM hits and minimise loading delay without significantly degrading generation quality. Compared to various static compression baselines across three tasks, our system AdaptCache achieves 1.43--2.4 x delay savings at the same quality and 6--55% quality improvements at the same delay.
翻译:大型语言模型(LLM)应用常需复用先前处理的上下文(如聊天历史与文档),这带来了显著的计算冗余。现有LLM服务系统通过存储已处理上下文的KV缓存,并在新请求复用上下文时加载相应KV缓存来解决此类冗余计算。随着LLM应用规模扩大,KV缓存总容量将急剧增长,需同时依赖DRAM与SSD进行完整存储。然而,现有将KV缓存存储于DRAM与SSD的方案存在高加载延迟问题,因为多数KV缓存命中来自读取速度较慢的SSD。为提升DRAM的KV缓存命中率,我们将有损KV缓存压缩识别为一种有效途径。本文设计了一种有损压缩系统,该系统能够为每个KV缓存条目动态决定压缩算法、压缩率及设备放置策略,从而在保证生成质量不显著下降的前提下,最大化DRAM命中率并最小化加载延迟。在三种任务场景下与多种静态压缩基线对比表明,本系统AdaptCache在同等质量下可实现1.43–2.4倍的延迟节省,在同等延迟下可获得6–55%的质量提升。