Cost-Aware Speculative Execution for LLM-Agent Workflows: An Integrated Five-Dimension Method

LLM-agent workflows chain model calls and tool invocations, and spend most of their wall-clock time waiting on upstream operations before downstream ones can start. Speculative execution can reclaim that idle time by launching a downstream operation with a predicted upstream input, but here each speculation costs real money (per-token billing) and its success probability is hard to estimate and drifts over time. This paper presents a method organized around five design decisions: (D1) start a downstream operation before its upstream completes; (D2) price each speculation in real dollars at separate input and output rates; (D3) expose a single operator dial for latency versus cost; (D4) decide via an expected-value rule with a failure-weighted cost term and a preference-adjusted threshold; and (D5) estimate the success probability with a Bayesian Beta-Binomial posterior whose prior is keyed to a dependency-type taxonomy. Variants of these ideas appear in recent work; the combination, with every decision logged in dollars, is what is new. The rule fires only on edges passing an admissibility precondition (side-effect-free, idempotent, or stageable behind a commit barrier), since a wrong speculation is rolled back by re-execution, which refunds tokens but cannot un-send an irreversible side effect. We specify the runtime mechanics, a closed-form result that the rule self-limits as the upstream branching factor grows, a five-stage calibration pipeline (offline replay, shadow, canary, online calibration, drift-triggered kill-switch), and a workload-fit rubric over eight production archetypes. Contrast tables against the four closest published systems (DSP, Speculative Actions v2, Sherlock, B-PASTE) show differentiators on every dimension, and a synthetic validation suite confirms the predicted decision boundary, probability threshold, posterior recovery, and streaming-cancellation behavior.

翻译：LLM智能体工作流串联模型调用与工具操作，其大部分挂钟时间消耗在上游操作完成后才能启动下游操作的等待过程中。投机执行通过使用预测性上游输入提前启动下游操作来回收空闲时间，但每次投机都需要实际金钱成本（按token计费），且其成功概率难以估计并随时间漂移。本文提出一种围绕五个设计决策组织的方法：（D1）在上游操作完成前启动下游操作；（D2）按输入和输出不同费率以实际美元量化每次投机成本；（D3）通过单一操作旋钮平衡延迟与成本；（D4）采用基于期望值的决策规则，包含故障加权成本项和偏好调整阈值；（D5）使用贝叶斯Beta-Binomial后验估计成功概率，其先验依据依赖类型分类体系。这些概念的变体在近期工作中已有出现，而本方法的新颖之处在于将每个决策以美元日志记录的形式进行组合。该规则仅适用于满足可准入前置条件的边（无副作用、幂等或可通过提交屏障分阶段执行），因为错误投机需通过重执行回滚，虽可退还token但无法撤销不可逆副作用。我们详细定义了运行时机制、随上游分支因子增长时规则自限制的闭式解、五阶段校准管线（离线重放、影子部署、金丝雀测试、在线校准、漂移触发熔断开关），以及针对八种生产原型的工作负载适配准则。通过与四种最接近的已发表系统（DSP、Speculative Actions v2、Sherlock、B-PASTE）进行对比表格分析，揭示每个维度的差异点，合成验证套件确认了预测的决策边界、概率阈值、后验恢复及流式取消行为。