The performance of multi-turn, agentic LLM inference is increasingly dominated by KV-Cache storage I/O rather than computation. In prevalent disaggregated architectures, loading the massive KV-Cache from external storage creates a fundamental imbalance: storage NICs on prefill engines become bandwidth-saturated, while those on decoding engines remain idle. This asymmetry severely constrains overall system throughput. We present DualPath, an inference system that breaks this bottleneck by introducing dual-path KV-Cache loading. Beyond the traditional storage-to-prefill path, DualPath enables a novel storage-to-decode path, in which the KV-Cache is loaded into decoding engines and then efficiently transferred to prefill engines via RDMA over the compute network. DualPath combines this optimized data path -- which inherently avoids network congestion and avoids interference with latency-critical model execution communications -- with a global scheduler that dynamically balances load across prefill and decode engines. Our evaluation on three models with production agentic workloads demonstrates that DualPath improves offline inference throughput by up to 1.87$\times$ on our in-house inference system. It can also improve online serving throughput by an average factor of 1.96$\times$ without violating SLO.
翻译:多轮智能体化大语言模型推理的性能日益受限于KV缓存存储I/O而非计算。在主流解耦架构中,从外部存储加载海量KV缓存造成了一个根本性失衡:预填充引擎上的存储网卡达到带宽饱和,而解码引擎上的存储网卡却处于闲置状态。这种不对称性严重制约了整体系统吞吐量。本文提出DualPath,一种通过引入双路径KV缓存加载机制来打破此瓶颈的推理系统。除了传统的存储到预填充路径外,DualPath开创性地实现了存储到解码路径——KV缓存先加载至解码引擎,随后通过计算网络上的RDMA高效传输至预填充引擎。DualPath将这一优化数据路径(其本质上避免了网络拥塞,且不与延迟敏感的模型执行通信产生干扰)与全局调度器相结合,动态平衡预填充与解码引擎间的负载。我们在三种模型上使用生产级智能体工作负载进行的评估表明,DualPath在自研推理系统上将离线推理吞吐量最高提升至1.87$\times$,同时在线服务吞吐量平均提升1.96$\times$且不违反服务等级协议。