We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent -- the standard paradigm -- PolyKV writes a compressed cache once and injects it into N independent agent contexts via HuggingFace DynamicCache objects. Compression is asymmetric: Keys are quantized at int8 (q8_0) to preserve softmax stability, while Values are compressed using TurboQuant MSE -- a Fast Walsh-Hadamard Transform (FWHT) rotation followed by 3-bit Lloyd-Max quantization with centroids tuned to N(0,1). We evaluate across two model scales (SmolLM2-1.7B-Instruct and Llama-3-8B-Instruct), three context lengths (600-7,194 tokens), and up to 15 concurrent agents. PolyKV achieves a stable 2.91x compression ratio across all configurations. On Llama-3-8B with 15 agents sharing a 4K-token context, PolyKV reduces KV cache memory from 19.8 GB to 0.45 GB -- a 97.7% reduction -- while maintaining only +0.57% perplexity degradation and a mean BERTScore F1 of 0.928. PPL delta does not grow with agent count and improves as context length increases, inverting to -0.26% at 1,851 coherent tokens. To our knowledge, no prior work combines a single shared, lossy-compressed KV pool with multi-reader concurrent agent access.
翻译:我们提出PolyKV系统,该系统允许多个并发推理智能体共享一个非对称压缩的KV缓存池。与每个智能体独立分配KV缓存的标准范式不同,PolyKV将压缩后的缓存一次性写入,并通过HuggingFace DynamicCache对象注入N个独立智能体上下文。压缩方式为非对称:键值采用int8量化(q8_0)以保持softmax稳定性,而值则通过TurboQuant MSE进行压缩——该方法包含快速沃尔什-哈达玛变换旋转后接3比特劳埃德-麦克斯量化,且质心针对N(0,1)分布进行调优。我们在两种模型规模(SmolLM2-1.7B-Instruct与Llama-3-8B-Instruct)、三种上下文长度(600-7,194词元)以及最多15个并发智能体的设置下进行评估。PolyKV在所有配置下均实现稳定的2.91倍压缩比。以Llama-3-8B为例,当15个智能体共享4K词元上下文时,PolyKV将KV缓存内存从19.8 GB降至0.45 GB(降幅达97.7%),同时仅导致+0.57%的困惑度退化,平均BERTScore F1值达0.928。困惑度增量不随智能体数量增长,且随上下文长度增加而改善——在1,851个连贯词元时逆转至-0.26%。据我们所知,尚无先前工作将单一共享有损压缩KV池与多读取器并发智能体访问相结合。