Remote KV cache reuse fetches KV cache for identical contexts from remote storage, avoiding recomputation, accelerating LLM inference. While it excels in high-speed networks, its performance degrades significantly in bandwidth-limited scenarios. Recent studies address this by transmitting KV caches in compressed form, but the associated heavyweight decompression counteracts the KV reuse benefits. In this paper, we propose an efficient and widely deployable remote KV cache reuse solution that leverages GPU-native video codecs. Our system, KVFetcher, enables effective KV cache coding with two techniques. The codec-friendly tensor layout compresses the KV cache in a highly compact video format, enabling fast transmission. The efficient KV fetcher orchestrates the transmission, decoding, and restoration of compressed KV caches in an efficient pipelined manner, eliminating resource contention, masking network fluctuations, and achieving minimum time-to-first-token (TTFT). We prototype KVFetcher on diverse GPUs from high- to low-end. Experiments reveal that it reduces TTFT by up to 3.51 times while maintaining lossless accuracy, compared to SOTA methods.
翻译:远程KV缓存复用通过从远程存储获取相同上下文的KV缓存,避免了重复计算,从而加速了大型语言模型(LLM)的推理。虽然该方法在高速网络中表现出色,但在带宽受限的场景下,其性能会显著下降。最近的研究通过传输压缩形式的KV缓存来解决此问题,但相关的重量级解压操作抵消了KV复用的优势。本文提出了一种高效且可广泛部署的远程KV缓存复用解决方案,该方案利用了GPU原生的视频编解码器。我们的系统KVFetcher通过两种技术实现了有效的KV缓存编码。编解码器友好的张量布局将KV缓存压缩为高度紧凑的视频格式,从而实现快速传输。高效的KV获取器以流水线方式高效地协调压缩KV缓存的传输、解码和恢复,消除了资源争用,掩盖了网络波动,并实现了最短的首令牌生成时间(TTFT)。我们在从高端到低端的多种GPU上对KVFetcher进行了原型实现。实验表明,与最先进(SOTA)方法相比,它在保持无损精度的同时,将TTFT降低了最高3.51倍。