Harnesses have become a central determinant of coding-agent performance, shaping how models interact with repositories, tools, and execution environments. Yet automating harness engineering is hard: a heterogeneous action space, sparse and noisy evaluation signal, multi-million-token trajectories, and edits whose effect is hard to attribute to the next round's outcomes. We introduce Agentic Harness Engineering (AHE), a framework that automates harness-level evolution by instrumenting the three stages of any engineering loop (component editing, trajectory inspection, and decision making) with matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. These results position observability-driven evolution as a practical pathway to keep coding-agent harnesses continually improving.
翻译:框架已成为编码智能体性能的核心决定因素,塑造了模型与代码库、工具及执行环境的交互方式。然而,自动化框架工程面临诸多挑战:异构动作空间、稀疏且噪声化的评估信号、百万级token轨迹,以及难以归因于下一轮次结果的编辑效果。我们提出智能体框架工程(AHE),该框架通过为工程循环的三大阶段(组件编辑、轨迹检查与决策制定)配备可观测性支柱实现框架级自主进化:(1)组件可观测性为每个可编辑框架组件提供文件级表征,使动作空间明确且可逆;(2)经验可观测性将百万级原始轨迹token提炼为分层、可钻取的证据语料库,供进化智能体实际使用;(3)决策可观测性为每次编辑配对自声明预测,随后依据下一轮次任务级结果进行验证。这些支柱共同将每次编辑转化为可证伪契约,使框架进化得以自主进行而不会陷入试错循环。实验表明,经过十轮AHE迭代,Terminal-Bench 2的pass@1从69.7%提升至77.0%,超越人工设计的Codex-CLI框架(71.9%)以及自进化基线ACE和TF-GRPO。冻结后的框架无需再进化即可迁移:在SWE-bench-verified上以较种子模型少12%的token实现最高聚合成功率,在Terminal-Bench 2上跨三个不同模型族获得+5.1至+10.1百分点的跨族增益,表明进化组件编码了通用工程经验而非基准特定调优。这些结果表明,可观测性驱动的进化为持续改进编码智能体框架提供了可行路径。