LLM-powered agents execute tasks through a sequential loop of model generation and tool execution. Today's serving systems serialize this loop, leaving tool latency exposed on the task critical path. This paper presents PASTE, a tool-aware agent-serving system that predicts concrete future tool invocations from recurring agent patterns and executes them speculatively while the LLM is still generating. PASTE isolates speculative results until confirmed by the LLM and jointly schedules tool execution and returning LLM sessions to avoid shifting bottlenecks to the GPU. Across deep research, coding, and scientific-agent workloads, PASTE reduces average task completion time by 43.5% and lowers observed tool latency by 1.8x.
翻译:大语言模型驱动的智能体通过模型生成与工具执行的顺序循环完成任务。当前的服务系统将该循环串行化,导致工具延迟暴露在任务关键路径上。本文提出PASTE系统——一种工具感知的智能体服务系统,能够从重复出现的智能体模式中预测具体未来工具调用,并在语言模型仍在生成时投机执行这些调用。PASTE将投机结果隔离存储至获得语言模型确认,并联合调度工具执行与返回中的大语言模型会话,以避免瓶颈转移至GPU。在深度研究、编程及科学智能体工作负载中,PASTE将平均任务完成时间降低43.5%,并将观测到的工具延迟降低至1.8倍。