Multi-agent large language model (LLM) systems are increasingly adopted for complex language processing tasks that require communication and coordination among agents. However, these systems often suffer substantial overhead from repeated reprocessing of overlapping contexts across agents. In typical pipelines, once an agent receives a message from its predecessor, the full context-including prior turns-must be reprocessed from scratch, leading to inefficient processing. While key-value (KV) caching is an effective solution for avoiding redundant computation in single-agent settings where prefixes remain unchanged, it cannot be directly reused in multi-agent scenarios due to diverging prefixes introduced by agent-specific context extensions. We identify that the core challenge lies in the offset variance of KV-caches across agents. To address this, we propose KVCOMM, a training-free framework that enables efficient prefilling in multi-agent inference by reusing KV-caches and aligning cache offsets of overlapping contexts under diverse prefix contexts. KVCOMM estimates and adjusts KV-caches for shared content by referencing a pool of cached examples-termed anchors-that store observed cache deviations under varying prefixes. The anchor pool is maintained and updated online, allowing dynamic adaptation to distinct user requests and context structures. KVCOMM achieves over 70% reuse rate across diverse multi-agent workloads, including retrieval-augmented generation, math reasoning, and collaborative coding tasks, all without quality degradation. Particularly, when each fully-connected agent receives 1K input tokens with 512 prefix tokens and 512 output tokens under a five-agent setting, KVCOMM achieves up to 7.8x speedup compared to the standard prefill pipeline, reducing TTFT from ~430 ms to ~55 ms.
翻译:多智能体大语言模型(LLM)系统在需要智能体间通信与协作的复杂语言处理任务中日益普及。然而,这些系统常因智能体间重复处理重叠上下文而产生显著开销。在典型流程中,一旦智能体收到来自前驱的消息,包含先前轮次的完整上下文必须从头重新处理,导致处理效率低下。尽管键值(KV)缓存在前缀保持不变的单智能体场景中是避免冗余计算的有效方案,但由于智能体特定上下文扩展引入的分歧前缀,该技术无法直接复用于多智能体场景。我们发现核心挑战在于跨智能体KV缓存的偏移量差异。为此,我们提出KVCOMM——一种免训练框架,通过复用KV缓存并在不同前缀上下文中对齐重叠上下文的缓存偏移,实现多智能体推理中的高效预填充。KVCOMM通过参考存储了不同前缀下观测到的缓存偏差的缓存示例池(称为锚点),对共享内容的KV缓存进行估计与调整。锚点池在线维护与更新,可动态适应不同的用户请求和上下文结构。KVCOMM在包括检索增强生成、数学推理和协作编程在内的多样化多智能体工作负载中实现了超过70%的复用率,且无质量损失。特别地,在五智能体场景中,当每个全连接智能体接收1K输入令牌(含512前缀令牌与512输出令牌)时,相较于标准预填充流程,KVCOMM实现了最高7.8倍的加速,将首令牌生成时间从约430毫秒降低至约55毫秒。