Large Language Models (LLMs) are increasingly deployed in complex multi-agent applications that use external function calls. This workload creates severe performance challenges for the KV Cache: space contention leads to the eviction of critical agents' caches and time underutilization leaves the cache of agents stalled on long-running tool calls idling in GPU memory. We present Tokencake, a KV-Cache-centric serving framework that co-optimizes scheduling and memory management with an agent-aware design. Tokencake's Space Scheduler uses dynamic memory partitioning to shield critical agents from contention, while its Time Scheduler employs a proactive offload and predictive upload mechanism to repurpose GPU memory during function call stalls. Our evaluation on representative multi-agent benchmarks shows that Tokencake can reduce end-to-end latency by over 47.06%, improve effective GPU memory utilization by up to 16.9% compared to vLLM.
翻译:大型语言模型(LLMs)正日益部署于利用外部函数调用的复杂多智能体应用中。此类工作负载对KV缓存造成了严重的性能挑战:空间争用导致关键智能体缓存被驱逐,而时间利用率不足则使因长时间工具调用而停滞的智能体缓存闲置于GPU内存中。本文提出Tokencake,一个以KV缓存为中心的服务框架,通过智能体感知设计协同优化调度与内存管理。Tokencake的空间调度器采用动态内存分区以保护关键智能体免受争用影响,其时间调度器则通过主动卸载与预测性上传机制,在函数调用停滞期间重新分配GPU内存。我们在代表性多智能体基准测试上的评估表明,相较于vLLM,Tokencake能将端到端延迟降低超过47.06%,并将有效GPU内存利用率提升高达16.9%。