LLM-powered agents execute tasks through a sequential loop of model generation and tool execution. Today's serving systems serialize this loop, leaving tool latency exposed on the task critical path. This paper presents PASTE, a tool-aware agent-serving system that predicts concrete future tool invocations from recurring agent patterns and executes them speculatively while the LLM is still generating. PASTE isolates speculative results until confirmed by the LLM and jointly schedules tool execution and returning LLM sessions to avoid shifting bottlenecks to the GPU. Across deep research, coding, and scientific-agent workloads, PASTE reduces average task completion time by 43.5% and lowers observed tool latency by 1.8x.
翻译:基于大语言模型的智能体通过模型生成与工具执行的顺序循环来完成任务。当前的服务系统将这一循环串行化,使得工具延迟暴露在任务关键路径上。本文提出PASTE系统——一种面向工具的智能体服务系统,能够从重复出现的智能体模式中预测具体的未来工具调用,并在LLM仍在生成的同时投机执行这些调用。PASTE将投机结果隔离保存,直到获得LLM确认,并联合调度工具执行与返回中的LLM会话,避免瓶颈转移至GPU。在深度研究、编码及科学智能体工作负载上,PASTE将平均任务完成时间缩短43.5%,并将观测到的工具延迟降低1.8倍。