Distributed prefix caching has become a core technique for efficient LLM serving. However, for long-context requests with high cache hit ratios, retrieving reusable KVCache blocks from remote servers has emerged as a new performance bottleneck. Such network-intensive LLM inference is expected to become increasingly common as agentic AI workloads continue to grow. However, existing LLM inference engines remain largely compute-centric: they treat KVCache loading as a subordinate phase to GPU execution and often fail to account for its delay explicitly during scheduling. We present CALVO, an LLM serving engine that treats KVCache loading as a first-class concern. CALVO decouples KVCache loading and GPU computation into independently managed, asynchronously progressing stages, enabling better utilization of network, PCIe, and computation resources. In addition, CALVO incorporates KVCache loading delay as an explicit component of per-request service cost, leading to more accurate scheduling decisions. Experiments on a real testbed with diverse long-context workloads show that CALVO substantially improves the efficiency of network-intensive LLM inference, achieving up to 61.67% higher SLO attainment than the baseline.
翻译:分布式前缀缓存已成为高效大语言模型服务的核心技术。然而,对于缓存命中率较高的长上下文请求,从远程服务器检索可复用KVCache块正成为新的性能瓶颈。随着智能体型AI工作负载持续增长,此类网络密集型大语言模型推理预计将变得越来越普遍。然而,现有的大语言模型推理引擎仍以计算为中心:它们将KVCache加载视为GPU执行的附属阶段,在调度时往往未能显式考虑其延迟。本文提出CALVO,一种将KVCache加载视为一等关切的大语言模型服务引擎。CALVO将KVCache加载与GPU计算解耦为独立管理、异步推进的阶段,从而更好地利用网络、PCIe和计算资源。此外,CALVO将KVCache加载延迟作为每请求服务成本的显式组成部分纳入考量,从而实现更精确的调度决策。在使用多样化长上下文工作负载的实际测试平台上的实验表明,CALVO显著提升了网络密集型大语言模型推理的效率,与基准方法相比,SLO达成率最高提升61.67%。