Multi-agent systems increasingly orchestrate multiple specialized language models to solve complex real-world problems, often invoking them over a shared context. This execution pattern repeatedly processes the same prompt prefix across models. Consequently, each model redundantly executes the prefill stage and maintains its own key-value (KV) cache, increasing aggregate prefill load and worsening tail latency by intensifying prefill-decode interference in existing LLM serving stacks. Disaggregated serving reduces such interference by placing prefill and decode on separate GPUs, but disaggregation does not fundamentally eliminate inter-model redundancy in computation and KV storage for the same prompt. To address this issue, we propose PrefillShare, a novel algorithm that enables sharing the prefill stage across multiple models in a disaggregated setting. PrefillShare factorizes the model into prefill and decode modules, freezes the prefill module, and fine-tunes only the decode module. This design allows multiple task-specific models to share a prefill module and the KV cache generated for the same prompt. We further introduce a routing mechanism that enables effective prefill sharing across heterogeneous models in a vLLM-based disaggregated system. PrefillShare not only matches full fine-tuning accuracy on a broad range of tasks and models, but also delivers 4.5x lower p95 latency and 3.9x higher throughput in multi-model agent workloads.
翻译:多智能体系统日益依赖协调多个专用语言模型来解决复杂的现实问题,这些模型通常在共享上下文中被调用。这种执行模式会在不同模型间重复处理相同的提示前缀。因此,每个模型都会冗余执行预填充阶段并维护其自身的键值(KV)缓存,从而增加了聚合预填充负载,并通过加剧现有LLM服务栈中的预填充-解码干扰而恶化了尾部延迟。解耦式服务通过将预填充和解码部署在独立的GPU上来减少此类干扰,但解耦本身并未从根本上消除针对同一提示在不同模型间的计算与KV存储冗余。为解决此问题,我们提出了PrefillShare,一种新颖的算法,可在解耦式部署环境中实现跨多个模型的预填充阶段共享。PrefillShare将模型分解为预填充模块和解码模块,冻结预填充模块,并仅对解码模块进行微调。该设计允许多个任务专用模型共享一个预填充模块以及为相同提示生成的KV缓存。我们进一步引入了一种路由机制,使得在基于vLLM的解耦式系统中能够实现跨异构模型的有效预填充共享。PrefillShare不仅在广泛的任务和模型上达到了与全参数微调相当的精度,而且在多模型智能体工作负载中实现了4.5倍的p95延迟降低和3.9倍的吞吐量提升。