Agentic applications are LLMs that iteratively invoke external tools to accomplish complex tasks. Such tool-based agents are rapidly becoming the dominant paradigm for deploying language models in production. Unlike traditional single-turn inference, agentic workloads chain together multiple LLM calls and tool executions before producing a final response, creating a new performance bottleneck that manifests as increased latency in First Token Rendered (FTR) of the final answer. Through analysis of requests at production scale, we reveal three critical challenges: tool calls account for 30-85% of FTR latency, KV cache hit rates collapse despite substantial context reuse across iterations, and sequential orchestration wastes potential intra-request parallelism. These bottlenecks stem from a design gap in which orchestrators and LLM engines operate as decoupled black boxes, preventing cross-layer optimizations. We present Sutradhara, a co-designed agentic inference system that integrates orchestration with LLM serving through a thin API enabling three optimizations: overlap tool execution with subsequent LLM prefill using tool-aware prompt splitting, streaming tool execution to dispatch tools incrementally during decode rather than waiting for complete output, and orchestrator-aware cache management that uses semantic hints to improve hit rates and reduce thrashing. Implemented on vLLM, Sutradhara improves the throughput-latency trade-off in agentic systems, sustains up to 77% higher load at the same median FTR latency, or reduces median FTR latency by up to 15% at the same load while reducing end-to-end latency by upto 11% on A100 GPUs.
翻译:摘要:智能体应用是能够通过迭代调用外部工具完成复杂任务的大语言模型。这类基于工具的智能体正迅速成为生产环境中部署语言模型的主流范式。与传统单轮推理不同,智能体工作负载在生成最终响应前需串联多次LLM调用与工具执行,由此引发新的性能瓶颈,体现为最终答案的首 Token 渲染(FTR)延迟显著增加。通过对生产规模请求的分析,我们揭示了三个关键挑战:工具调用占FTR延迟的30%-85%;尽管跨迭代存在大量上下文复用,KV缓存命中率仍显著下降;顺序化编排浪费了请求内潜在的并行性。这些瓶颈源于编排器与LLM引擎以解耦黑箱方式运作的设计鸿沟,导致跨层优化无法实现。我们提出Sutradhara——一种协同设计的智能体推理系统,通过轻量级API将编排与LLM服务深度融合,实现三大优化:利用工具感知式提示拆分将工具执行与后续LLM预填充重叠;采用流式工具执行机制,在解码过程中增量分发工具调用而非等待完整输出;以及通过编排感知的缓存管理,利用语义提示提升缓存命中率并减少抖动。基于vLLM实现的Sutradhara改善了智能体系统的吞吐-延迟权衡:在相同中位FTR延迟下可维持高达77%的负载提升,或在同等负载下将中位FTR延迟降低15%,同时将A100 GPU上的端到端延迟降低最多11%。