Multi-turn LLM agents interleave model calls with external tool invocations, shifting serving from stateless request processing to stateful program execution. Serving these workloads requires scheduling, KV-cache management, and routing policies that use program-level context, including turn dependencies, tool-induced gaps, and reusable KV state. Evaluating such policies directly on real systems is costly, since each design point may require dedicated accelerator time across arrival rates, model scales, serving-instance counts, and memory hierarchies. Simulation offers a scalable alternative, but existing LLM serving simulators target stateless request-level workloads and therefore omit the core dynamics of agent serving: multi-turn program execution, cross-turn cache locality, and KV-cache residency during tool gaps. We present AGENTSERVESIM, a hardware-aware simulator for multi-turn LLM agent serving. AGENTSERVESIM evaluates serving policies at program granularity through composable modules: a Program Orchestrator preserves program identity and turn order, a Tool Simulator materializes tool-induced gaps, a Session-Aware Router maintains program-to-instance affinity for cache-aware dispatch, and a KV Residency Model tracks policy-defined KV placement across HBM, host DRAM/CXL, and eviction. Across real serving deployments and hardware configurations, AGENTSERVESIM reproduces real-system behavior within 6% error across key performance metrics while running entirely on commodity CPUs. These results show that AGENTSERVESIM enables controlled, repeatable exploration of agent-serving policies without requiring exhaustive deployment on costly accelerators.
翻译:多轮LLM智能体将模型调用与外部工具调用交织在一起,使服务模式从无状态请求处理转变为有状态程序执行。服务此类工作负载需要调度、KV缓存管理和路由策略,这些策略需利用程序级上下文信息,包括轮次依赖关系、工具调用间隙和可复用KV状态。直接在真实系统上评估这些策略成本高昂,因为每个设计点都可能需要在不同到达率、模型规模、服务实例数量和存储层次上占用专用加速器时间。模拟提供了一种可扩展的替代方案,但现有LLM服务模拟器主要针对无状态请求级工作负载,因而忽略了智能体服务的核心动态:多轮程序执行、跨轮缓存局部性以及工具间隙期间的KV缓存驻留。我们提出AGENTSERVESIM——面向多轮LLM智能体服务的硬件感知模拟器。AGENTSERVESIM通过可组合模块在程序粒度上评估服务策略:程序编排器维护程序标识与轮次顺序,工具模拟器实体化工具引起的间隙,会话感知路由维护程序与实例间的亲和性以实现缓存感知调度,KV驻留模型跨HBM、主机DRAM/CXL和驱逐策略追踪策略定义的KV放置。在实际服务部署和硬件配置下,AGENTSERVESIM在关键性能指标上实现了与真实系统行为误差在6%以内的复现,且完全运行于商用CPU上。这些结果表明,AGENTSERVESIM无需在昂贵加速器上进行全面部署即可实现对智能体服务策略的可控可重复探索。