Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents

Long-horizon LLM agents are not trusted to run unattended: with no human watching, they confidently report success they never verified. We treat honesty -- bounding what an agent may claim at termination -- as a first-class metric for unattended autonomy, distinct from capability. We present Autopilot, an execution model that makes silent fabricated success structurally impossible rather than merely rarer. Autopilot externalizes all working state into a durable, gated finite-state machine that a scheduler advances one stateless tick at a time; a hard floor forbids any terminal "done" claim whose falsifiable gate did not actually execute and pass. We prove a No-False-Success theorem -- under gate soundness, floor enforcement, and plan coverage, termination implies the goal holds -- whose only trust points are empirically measurable, and show the worst case degrades to an honest stall, never a fabricated success. Because each tick rehydrates only the state machine, per-step context cost is constant in the horizon. Across a 3,150-cell paired corpus (70 tasks $\times$ 3 systems $\times$ 3 models $\times$ 5 seeds, including 50 SWE-bench Lite tasks across 11 OSS repos), Autopilot fabricates on 0.95% of cells [95% CI 0.38--1.62] while Reflexion and StateFlow baselines fabricate on 8.10% [6.48--9.81] and 25.05% [22.48--27.62] respectively. The headline contrast lives in the hard regime: on SWE-bench Lite, the firewall reduces fabrication from 33.7% (StateFlow) to 0.67%, a paired difference of $-33.07$ pp [95% CI $-36.53, -29.73$]. The mechanism is the gate, not the model: all ten Autopilot fabrications come from the strongest model, while two weaker mid-tier models never fabricate across 700 paired cells. The firewall trades coverage for honesty by design -- an honest stall is recoverable; a confident wrong output shipped downstream is not.

翻译：长期任务的大语言模型智能体在无人值守运行时不可信赖：在无人监督的情况下，它们会自信地报告从未验证过的成功结果。我们将诚实性——限制智能体在终止时可能声称的内容——视为无人值守自主性的首要指标，与能力指标相区分。我们提出自动驾驶仪（Autopilot）这一执行模型，该模型使静默的虚假成功在结构上变得不可能，而不仅仅是减少其发生频率。自动驾驶仪将所有工作状态外部化至一个持久化的、带有门控机制的有限状态机中，调度器每次以无状态单步方式推进该状态机；一个硬性底线机制禁止任何终端“完成”声明——除非其可证伪的门控机制实际执行并通过。我们证明了“无虚假成功定理”——在门控正确性、底线执行和计划覆盖的假设下，智能体终止意味着目标达成——其唯一信任点可通过经验测量。最坏情况会退化为诚实的停滞，而非虚假成功。由于每个时间步仅重构状态机，每步上下文成本在任务周期内保持恒定。在一个包含3,150个测试样本的配对语料库（70个任务 × 3个系统 × 3个模型 × 5个随机种子，涵盖11个开源代码仓库中的50个SWE-bench Lite任务）中，自动驾驶仪的伪造发生率为0.95% [95%置信区间：0.38–1.62]，而Reflexion和StateFlow基线分别为8.10% [6.48–9.81]和25.05% [22.48–27.62]。关键差异体现在高难度场景：在SWE-bench Lite上，该防火墙将伪造率从33.7%（StateFlow）降至0.67%，配对差异为-33.07个百分点 [95%置信区间：-36.53, -29.73]。其核心机制在于门控而非模型：所有10次自动驾驶仪伪造均源于最强模型，而两个较弱的中等模型在700个配对测试中从未出现伪造。该防火墙通过设计在覆盖率与诚实性之间进行权衡——诚实的停滞是可恢复的，而将错误结果自信地传递至下游则不可恢复。