Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call "spans") such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV entries unless they share identical prefixes with another request, while Position-Independent Caching (PIC) implementations within production-grade inference servers typically either require substantial server code changes or keep KV state outside the server, incurring host-to-device transfer overhead. We present Minimalistic PIC (MiniPIC): a minimal, flexible and fast vLLM design built from two ingredients: positional-encoding-free KV cache and user-controlled cache-reuse primitives. MiniPIC stores unrotated K vectors in the KV cache, applies RoPE to K tiles inside attention using per-request logical positions, and exposes three user-facing and token-level primitives: block-aligned padding, span separator (SSep), and prompt depend (PDep), that modify hashing behavior and effective block-level causal attention structure. With fewer than 100 lines of core-engine changes plus a custom attention backend, these primitives are sufficient to realize multiple PIC methods, including Block-Attention, EPIC, and Prompt Cache, within the same running vLLM instance, while natively integrating with KV cache CPU offload implementations. On 2WikiMultihopQA, MiniPIC with interleaved scheduling improves prefill throughput by 49% over baseline vLLM, reduces cached-span time-to-first-token by up to two orders of magnitude, preserves the linear prefill scaling of uncached spans, and incurs only 5.7% worst-case overhead.
翻译:检索增强和智能代理工作负载会频繁预填充可重复出现的结构化输入(我们称之为"段落"),例如文档和代码文件。然而,在vLLM等引擎中使用的前缀缓存机制,除非请求共享相同前缀,否则无法复用KV条目;而生产级推理服务器中的位置无关缓存(PIC)实现,要么需要大量修改服务器代码,要么将KV状态保留在服务器外部,从而产生主机到设备的数据传输开销。我们提出MiniPIC(极简位置无关缓存):一种由两个核心组件构成的轻量、灵活且高效的vLLM设计方案——无需位置编码的KV缓存和用户可控的缓存复用原语。MiniPIC在KV缓存中存储未旋转的K向量,通过按请求逻辑位置对注意力计算中的K块应用RoPE,并暴露三个面向用户的token级原语:块对齐填充(block-aligned padding)、段落分隔符(SSep)和提示依赖(PDep),这些原语可修改哈希行为及有效的块级因果注意力结构。通过不足百行的核心引擎修改代码及自定义注意力后端,这些原语足以在同一个运行的vLLM实例中实现多种PIC方法(包括Block-Attention、EPIC和Prompt Cache),同时原生集成KV缓存CPU卸载方案。在2WikiMultihopQA数据集上,采用交错调度的MiniPIC相较于基准vLLM将预填充吞吐量提升49%,将缓存段的首token生成时间降低多达两个数量级,保持未缓存段线性预填充扩展特性,且最坏情况下的额外开销仅为5.7%。