LLM-powered agents are emerging as a dominant paradigm for autonomous task solving. Unlike standard inference workloads, agents operate in a strictly serial "LLM-tool" loop, where the LLM must wait for external tool execution at every step. This execution model introduces severe latency bottlenecks. To address this problem, we propose PASTE, a Pattern-Aware Speculative Tool Execution method designed to hide tool latency through speculation. PASTE is based on the insight that although agent requests are semantically diverse, they exhibit stable application level control flows (recurring tool-call sequences) and predictable data dependencies (parameter passing between tools). By exploiting these properties, PASTE improves agent serving performance through speculative tool execution. Experimental results against state of the art baselines show that PASTE reduces average task completion time by 48.5% and improves tool execution throughput by 1.8x.
翻译:基于大语言模型(LLM)的智能体正成为自主任务求解的主流范式。与标准推理负载不同,智能体在严格的串行“LLM-工具”循环中运行,LLM每一步都必须等待外部工具执行完毕。这种执行模型带来了严重的延迟瓶颈。为解决此问题,我们提出PASTE——一种模式感知的推测式工具执行方法,旨在通过推测来隐藏工具延迟。PASTE基于以下洞见:尽管智能体请求在语义上多样,但它们表现出稳定的应用层控制流(重复出现的工具调用序列)和可预测的数据依赖关系(工具间的参数传递)。通过利用这些特性,PASTE通过推测式工具执行提升了智能体服务性能。与最先进基线的实验结果表明,PASTE将平均任务完成时间减少48.5%,并将工具执行吞吐量提升1.8倍。