Coding agents are a fast-growing LLM application, executing as long-running closed-loop sessions in which LLM generations alternate with external tool calls. Yet, unlike chat workloads, their serving behavior has not been studied extensively. We address this gap by collecting a dataset of real-world coding assistant traces. Our analysis shows that coding agent sessions repeatedly reuse large prefixes and create sustained KVCache pressure that conventional LLM serving policies handle poorly. Based on our analysis, we present CacheWise, a KVCache management layer that improves KVCache reuse for coding agent workloads. CacheWise combines prefix-aware scheduling with reuse-aware eviction guided by lightweight predictions from tool call metadata. Implemented in vLLM and evaluated on the collected traces, CacheWise reduces KVCache evictions by up to 2-2.6x and improves total agent session completion time by up to 3.5x.
翻译:编码代理是一种快速发展的LLM应用,以长时间运行的闭环会话形式执行,其中LLM生成与外部工具调用交替进行。然而,与聊天工作负载不同,其服务行为尚未得到广泛研究。我们通过收集真实世界编码助手轨迹的数据集来填补这一空白。分析表明,编码代理会话重复重用大型前缀,并产生传统LLM服务策略难以处理的持续KVCache压力。基于分析,我们提出CacheWise,一种KVCache管理层,可改善编码代理工作负载的KVCache重用。CacheWise结合了前缀感知调度与基于工具调用元数据的轻量级预测引导的重用感知驱逐策略。在vLLM中实现并基于收集的轨迹进行评估,CacheWise将KVCache驱逐次数减少高达2-2.6倍,并将代理会话整体完成时间提升高达3.5倍。