Harnesses are now central to coding-agent performance, mediating how models interact with tools and execution environments. Yet harness engineering remains a manual craft, because automating it faces a heterogeneous action space across editable components, voluminous trajectories that bury actionable signal, and edits whose effect is hard to attribute. We introduce Agentic Harness Engineering (AHE), a closed loop that addresses these challenges through three matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. Ablations localize the gain to tools, middleware, and long-term memory rather than the system prompt, suggesting factual harness structure transfers while prose-level strategy does not.
翻译:框架现已成为编码智能体性能的核心,它决定了模型如何与工具和执行环境交互。然而,框架工程仍是一项手工活,因为其自动化面临着三大挑战:跨可编辑组件的异构动作空间、埋没可操作信号的海量轨迹数据,以及难以归因的编辑效果。我们提出智能体框架工程(Agentic Harness Engineering, AHE),这是一个通过三项协同可观测性支柱应对上述挑战的闭环系统:(1)组件可观测性 —— 为每个可编辑框架组件提供文件级表征,使动作空间显式且可回溯;(2)经验可观测性 —— 将数百万原始轨迹令牌提炼为可分层的、支持下钻的证据语料库,供演进中的智能体实际使用;(3)决策可观测性 —— 每次编辑都附带自声明的预测,并在下一轮任务级结果中得到验证。这三个支柱共同将每次编辑转化为可证伪的契约,使框架演进自主进行而不陷入试错循环。实验表明,经过十轮AHE迭代,Terminal-Bench 2上的pass@1从69.7%提升至77.0%,超越了人工设计的框架Codex-CLI(71.9%)以及自演进基线ACE和TF-GRPO。冻结后的框架无需重新演进即可迁移:在SWE-bench-verified数据集上,其聚合成功率较种子版本节省12%令牌;在Terminal-Bench 2上,跨三个不同模型系列实现+5.1至+10.1个百分点的家族间增益。这表明演进出的组件编码了通用的工程经验,而非针对特定基准的调优。消融实验将增益归因于工具、中间件和长期记忆,而非系统提示,表明事实性框架结构可迁移,而散文级策略则不能。