Agentic applications are LLMs that iteratively invoke external tools to accomplish complex tasks. Such tool-based agents are rapidly becoming the dominant paradigm for deploying language models in production. Unlike traditional single-turn inference, agentic workloads chain together multiple LLM calls and tool executions before producing a final response, creating a new performance bottleneck that manifests as increased latency in First Token Rendered (FTR) of the final answer. Through analysis of requests at production scale, we reveal three critical challenges: tool calls account for 30-85% of FTR latency, KV cache hit rates collapse despite substantial context reuse across iterations, and sequential orchestration wastes potential intra-request parallelism. These bottlenecks stem from a design gap in which orchestrators and LLM engines operate as decoupled black boxes, preventing cross-layer optimizations. We present Sutradhara, a co-designed agentic inference system that integrates orchestration with LLM serving through a thin API enabling three optimizations: overlap tool execution with subsequent LLM prefill using tool-aware prompt splitting, streaming tool execution to dispatch tools incrementally during decode rather than waiting for complete output, and orchestrator-aware cache management that uses semantic hints to improve hit rates and reduce thrashing. Implemented on vLLM, Sutradhara improves the throughput-latency trade-off in agentic systems, sustains up to 77% higher load at the same median FTR latency, or reduces median FTR latency by up to 15% at the same load while reducing end-to-end latency by upto 11% on A100 GPUs.
翻译:智能体应用是逐步调用外部工具完成复杂任务的大型语言模型。此类基于工具的智能体正迅速成为语言模型生产部署的主流范式。与传统单轮推理不同,智能体工作负载在生成最终响应前需串联多次LLM调用与工具执行过程,形成新的性能瓶颈,表现为最终答案的首词渲染延迟增加。通过对生产规模请求的分析,我们揭示了三个关键挑战:工具调用占首词渲染延迟的30%-85%;尽管迭代间存在显著的上下文复用,KV缓存命中率仍剧降;顺序编排浪费了请求内潜在并行性。这些瓶颈源于编排器与LLM引擎作为解耦黑箱运行的设计鸿沟,阻碍了跨层优化。我们提出Sutradhara——一种协同设计的智能体推理系统,通过轻量级API将编排功能与LLM服务集成,实现三项优化:利用工具感知的提示分割技术,使工具执行与后续LLM预填充重叠;采用流式工具执行机制,在解码阶段而非等待完整输出时增量调度工具;以及引入编排器感知的缓存管理,通过语义提示提升命中率并减少抖动。在vLLM上实现的Sutradhara优化了智能体系统的吞吐-延迟权衡,在同等中位首词渲染延迟下可承受高达77%的负载增加,或在同等负载下将中位首词渲染延迟降低15%,并在A100 GPU上将端到端延迟降低至多11%。