AI agents are increasingly deployed in complex, interactive environments, yet their runtime remains a major bottleneck for training, evaluation, and real-world use. Typical agent behavior unfolds sequentially, with each action requiring an API call that can incur substantial latency. For example, a game of chess between two state-of-the-art agents can take hours. We introduce Speculative Actions, a lossless acceleration framework for general agentic systems. Inspired by speculative execution in microprocessors and speculative decoding in LLM inference, our method uses faster models to predict likely future actions and execute them in parallel, committing only when predictions match. We evaluate speculative actions across gaming, e-commerce, and web search environments, and additionally study a lossy extension in an operating systems setting. Across domains, we achieve up to 55% next-action prediction accuracy, translating into up to 20% latency reductions. Finally, we present a cost-latency analysis that formalizes the tradeoff between speculative breadth and time savings. This analysis enables principled tuning and selective branch launching to ensure that multi-branch speculation delivers practical speedups without prohibitive cost growth.
翻译:AI智能体正越来越多地部署在复杂、交互式环境中,但其运行时开销仍是训练、评估及实际应用中的主要瓶颈。典型智能体行为需要顺序执行,每个动作对应一次API调用,可能产生显著延迟。例如,两个最先进智能体之间的国际象棋对弈可能耗时数小时。我们提出推测性动作——一种面向通用智能体系统的无损加速框架。受微处理器推测执行与大型语言模型推测解码启发,本方法利用快速模型预测未来可能动作并并行执行,仅在预测匹配时提交结果。我们在游戏、电子商务及网络搜索环境中评估推测性动作,同时研究操作系统场景下有损扩展方案。跨领域测试中,我们实现了高达55%的下一动作预测准确率,转化为最高20%的延迟降低。最后,我们提出成本-延迟分析框架,形式化描述了推测广度与时间节省之间的权衡关系。该分析支持原则性调优与选择性分支启动,确保多分支推测能够在不导致成本激增的情况下实现实际加速。