Multi-turn conversation is a fundamental scenario in LLM applications, widely used in chatbots and AI agents. As the conversation evolves, historical tokens accumulate continuously. Existing systems cache their key-value (KV) pairs to avoid redundant computation. However, limited GPU memory (HBM) capacity often forces these KV caches to be offloaded to CPU memory or SSD, making KV cache reloads increasingly costly in terms of latency as the context grows. Meanwhile, the constrained HBM capacity also limits the maximum inference length, thereby restricting the number of turns that can be supported in a conversation. To address these two challenges, we propose SwiftCache, a collaborative inference system that enables heterogeneous models to share underutilized GPU memory and NVLink bandwidth within a server. Specifically, models with low KV cache demand donate idle GPU memory to store the prefix cache of high-demand models, allowing cross-model KV cache sharing over NVLink and avoiding slow PCIe transfers. SwiftCache further reduces memory pressure by keeping only the KV cache of the currently active layer in local GPU memory, thereby enabling longer-context inference. Our experiments on real-world workloads show that SwiftCache reduces P99 time-to-first-token (TTFT) by up to 69% and extends maximum context length by up to 3.98x compared to vLLM and SGLang, with minimal interference to co-located models.
翻译:[translated abstract in Chinese]
多轮对话是大语言模型(LLM)应用中的基础场景,广泛应用于聊天机器人和AI代理。随着对话的推进,历史token持续累积。现有系统缓存其键值对(KV cache)以避免重复计算。然而,受限的GPU显存(HBM)容量常迫使这些KV缓存被卸载至CPU内存或SSD,导致随着上下文增长,KV缓存重载的延迟成本日益升高。同时,受限的HBM容量也限制了最大推理长度,进而制约了对话中可支持的多轮次数。为应对这两个挑战,我们提出SwiftCache——一个协作推理系统,允许服务器内异构模型共享未充分利用的GPU显存与NVLink带宽。具体而言,低KV缓存需求的模型捐赠空闲GPU显存来存储高需求模型的前缀缓存,从而实现跨模型通过NVLink共享KV缓存,避免缓慢的PCIe传输。SwiftCache通过仅在本地GPU内存中保留当前活跃层的KV缓存,进一步降低内存压力,从而支持更长上下文的推理。我们在真实工作负载上的实验表明,与vLLM和SGLang相比,SwiftCache将P99首次输出时间(TTFT)降低高达69%,最大上下文长度提升多达3.98倍,同时对同驻模型的干扰极小。