Agentic AI shifts LLM serving from isolated prompt-generation requests to stateful, multi-turn executions that repeatedly invoke the model, call tools, and grow context over time. This paper characterizes ReAct-style agents from both the LLM-serving and tool-execution perspectives using an end-to-end tracing infrastructure across reasoning and non-reasoning Gemma and Qwen configurations on five agentic benchmarks. Our study shows that agentic workloads are not simply long-prompt workloads: with effective context caching, most input tokens are reused across turns, making execution decode-dominated while increasing dependence on long-lived KV-cache state. We also find that tool use has a clear temporal structure, with agents shifting from read/explore behavior early in execution to execute/write behavior later. These results show that efficient agentic serving must jointly manage repeated model re-entry, persistent context state, and workload-dependent tool behavior.
翻译:Agentic AI将大语言模型服务从孤立的提示-生成请求转变为有状态、多轮交互的执行过程,这种执行方式会反复调用模型、使用工具,并随着时间推移不断增长上下文。本文通过端到端追踪基础设施,在五个Agent基准测试中,对Gemma和Qwen的推理与非推理配置下的ReAct风格Agent进行了特性分析,涵盖了大语言模型服务与工具执行两个视角。研究表明,Agent工作负载并非简单的长提示工作负载:在有效上下文缓存机制下,大部分输入token在轮次间被重复利用,这使得执行过程以解码阶段为主导,同时增加了对长存活KV缓存状态的依赖性。我们还发现工具使用具有清晰的时间结构,Agent从执行早期的读/探索行为转向执行后期的写/写入行为。这些结果表明,高效的Agent服务必须协同管理模型的重复重入、持久上下文状态以及依赖工作负载的工具行为。