Multi-agent LLM applications organize execution in synchronized rounds where a central scheduler gathers outputs from all agents and redistributes the combined context. This All-Gather communication pattern creates massive KV Cache redundancy, because every agent's prompt contains the same shared output blocks, yet existing reuse methods fail to exploit it efficiently. We present TokenDance, a system that scales the number of concurrent agents by exploiting the All-Gather pattern for collective KV Cache sharing. TokenDance's KV Collector performs KV Cache reuse over the full round in one collective step, so the cost of reusing a shared block is paid once regardless of agent count. Its Diff-Aware Storage encodes sibling caches as block-sparse diffs against a single master copy, achieving 11-17x compression on representative workloads. Evaluation on GenerativeAgents and AgentSociety shows that TokenDance supports up to 2.7x more concurrent agents than vLLM with prefix caching under SLO requirement, reduces per-agent KV Cache storage by up to 17.5x, and achieves up to 1.9x prefill speedup over per-request position-independent caching.
翻译:多智能体大语言模型应用以同步轮次方式组织执行,其中中央调度器收集所有智能体的输出并重新分配组合后的上下文。这种"全收集"通信模式会产生大量KV缓存冗余,因为每个智能体的提示词都包含相同的共享输出块,然而现有重用方法未能高效利用这一特性。本文提出TokenDance系统,该系统通过利用"全收集"模式实现集体KV缓存共享,从而扩展并发智能体的数量。TokenDance的KV收集器在一个集体步骤中完成整个轮次的KV缓存重用,因此无论智能体数量多少,重用共享块的成本仅需支付一次。其差异感知存储将兄弟缓存编码为针对单一主副本的块稀疏差异,在代表性工作负载上实现了11-17倍的压缩。在GenerativeAgents和AgentSociety上的评估表明,在SLO要求下,TokenDance相比采用前缀缓存的vLLM最多可支持2.7倍的并发智能体,将每个智能体的KV缓存存储降低17.5倍,并相比基于请求的位置无关缓存实现了最高1.9倍的预填充加速。