MEPIC：面向大语言模型服务的内存高效位置无关缓存 (MEPIC: Memory Efficient Position Independent Caching for LLM Serving)

Modern LLM applications such as deep-research assistants, coding agents, and Retrieval-Augmented Generation (RAG) systems, repeatedly process long prompt histories containing shared document or code chunks, creating significant pressure on the Key Value (KV) cache, which must operate within limited memory while sustaining high throughput and low latency. Prefix caching partially alleviates some of these costs by reusing KV cache for previously processed tokens, but limited by strict prefix matching. Position-independent caching (PIC) enables chunk-level reuse at arbitrary positions, but requires selective recomputation and positional-encoding (PE) adjustments. However, because these operations vary across queries, KV for the same chunk diverges across requests. Moreover, without page alignment, chunk KV layouts diverge in memory, preventing page sharing. These issues result in only modest HBM savings even when many requests reuse the same content. We present MEPIC, a memory-efficient PIC system that enables chunk KV reuse across positions, requests, and batches. MEPIC aligns chunk KV to paged storage, shifts recomputation from token- to block-level so only the first block is request-specific, removes positional encodings via Rotary Position Embedding (RoPE) fusion in the attention kernel, and makes remaining blocks fully shareable. These techniques eliminate most duplicate chunk KV in HBM, reducing usage by up to 2x over state-of-the-art PIC at comparable latency and accuracy, and up to 5x for long prompts, without any model changes.

翻译：现代大语言模型应用（如深度研究助手、代码生成代理以及检索增强生成系统）需反复处理包含共享文档或代码片段的长提示历史记录，这对键值缓存造成了巨大压力——缓存必须在有限内存内运行，同时维持高吞吐量与低延迟。前缀缓存通过复用先前已处理令牌的键值缓存部分缓解了部分开销，但受限于严格的前缀匹配。位置无关缓存支持在任意位置实现块级复用，但需要选择性重计算与位置编码调整。然而，由于这些操作因查询而异，同一内容块的键值缓存会在不同请求间产生差异。此外，若未进行页面对齐，内容块的键值缓存在内存中的布局会不一致，从而阻碍页面共享。这些问题导致即使多个请求复用相同内容，高带宽内存的节省效果仍十分有限。本文提出MEPIC，一种内存高效的位置无关缓存系统，支持跨位置、跨请求、跨批次的块级键值缓存复用。MEPIC将块级键值缓存与分页存储对齐，将重计算从令牌级转移至块级，使得仅首个块具有请求特异性，并通过注意力内核中的旋转位置编码融合技术消除位置编码，使剩余块完全可共享。这些技术消除了高带宽内存中大部分重复的块级键值缓存，在保持相当延迟与精度的前提下，较当前最优位置无关缓存减少高达2倍的内存使用，针对长提示场景更可降低达5倍，且无需任何模型改动。