Agentic applications are LLMs that iteratively invoke external tools to accomplish complex tasks. Such tool-based agents are rapidly becoming the dominant paradigm for deploying language models in production. Unlike traditional single-turn inference, agentic workloads chain together multiple LLM calls and tool executions before producing a final response, creating a new performance bottleneck that manifests as increased latency in First Token Rendered (FTR) of the final answer. Through analysis of synthetic requests at production scale, we reveal three critical challenges: tool calls account for 30-80% of FTR latency, KV cache hit rates collapse despite substantial context reuse across iterations, and sequential orchestration wastes potential intra-request parallelism by sequentially executing LLM calls and tools. These bottlenecks stem from a design gap in which orchestrators and LLM engines operate as decoupled black boxes, preventing cross-layer optimizations. We present SUTRADHARA, a co-designed agentic inference system that integrates orchestration with LLM serving through a thin API enabling three optimizations: overlap tool execution with subsequent LLM prefill using tool-aware prompt splitting, streaming tool execution to dispatch tools incrementally during decode rather than waiting for complete output, and orchestrator-aware cache management that uses semantic hints to improve hit rates and reduce thrashing. Implemented on vLLM, SUTRADHARA reduces median FTR latency by 15% and end-to-end latency by 10% across workloads on A100 GPUs, demonstrating that co-design can systematically tame latency in agentic systems.
翻译:智能体应用是通过迭代调用外部工具以完成复杂任务的大语言模型。此类基于工具的智能体正迅速成为生产环境中部署语言模型的主导范式。与传统单轮推理不同,智能体工作负载在生成最终响应前会串联多次大语言模型调用与工具执行,从而产生新的性能瓶颈,表现为最终答案的首令牌呈现延迟显著增加。通过对生产级合成请求的分析,我们揭示了三大关键挑战:工具调用占首令牌呈现延迟的30-80%;尽管迭代间存在大量上下文复用,键值缓存命中率却急剧下降;顺序编排机制因串行执行大语言模型调用与工具而浪费了请求内潜在的并行性。这些瓶颈源于编排器与大语言模型引擎作为解耦黑盒运行的设计缺陷,阻碍了跨层优化。本文提出SUTRADHARA——一个通过精简API集成编排与大语言模型服务的协同设计智能体推理系统,其支持三项优化:基于工具感知的提示分割实现工具执行与后续大语言模型预填充的重叠;流式工具执行在解码阶段增量调度工具而非等待完整输出;利用语义提示提升命中率并减少缓存抖动的编排器感知缓存管理机制。在vLLM上的实现表明,SUTRADHARA在A100 GPU各类工作负载中可将首令牌呈现延迟中位数降低15%,端到端延迟降低10%,证明协同设计能系统性地优化智能体系统延迟。