KV cache accelerates LLM inference by avoiding redundant computation, at the expense of memory. To support larger KV caches, prior work extends GPU memory with CPU memory via CPU-offloading. This involves swapping KV cache between GPU and CPU memory. However, because the cache updates dynamically, such swapping incurs high CPU memory traffic. We make a key observation that model parameters remain constant during runtime, unlike the dynamically updated KV cache. Building on this, we introduce Oneiros, which avoids KV cache swapping by remapping, and thereby repurposing, the memory allocated to model parameters for KV cache. This parameter remapping is especially beneficial in multi-tenant environments, where the memory used for the parameters of the inactive models can be more aggressively reclaimed. Exploiting the high CPU-GPU bandwidth offered by the modern hardware, such as the NVIDIA Grace Hopper Superchip, we show that Oneiros significantly outperforms state-of-the-art solutions, achieving a reduction of 44.8%-82.5% in tail time-between-token latency, 20.7%-99.3% in tail time-to-first-token latency, and 6.6%-86.7% higher throughput compared to vLLM. Source code of Oneiros is available at https://github.com/UT-SysML/Oneiros/.
翻译:KV缓存通过避免冗余计算来加速大语言模型(LLM)推理,但以内存占用为代价。为支持更大的KV缓存,先前工作通过CPU卸载将GPU内存扩展至CPU内存,涉及在GPU与CPU内存之间交换KV缓存。然而,由于缓存动态更新,此类交换会导致较高的CPU内存流量。我们观察到,与动态更新的KV缓存不同,模型参数在运行时保持恒定。基于此,我们提出了Oneiros,它通过重映射并重新利用分配给模型参数的内存来存储KV缓存,从而避免KV缓存交换。这种参数重映射在多租户环境中尤其有益,可更积极地回收非活跃模型参数占用的内存。利用现代硬件(如NVIDIA Grace Hopper超级芯片)提供的高CPU-GPU带宽,我们证明Oneiros显著优于现有先进解决方案:相比vLLM,其词元间延迟尾部降低了44.8%-82.5%,首词元生成延迟尾部降低了20.7%-99.3%,吞吐量提升了6.6%-86.7%。Oneiros源代码发布于https://github.com/UT-SysML/Oneiros/。