Harnesses have become a central determinant of coding-agent performance, shaping how models interact with repositories, tools, and execution environments. Yet automating harness engineering is hard: a heterogeneous action space, sparse and noisy evaluation signal, multi-million-token trajectories, and edits whose effect is hard to attribute to the next round's outcomes. We introduce Agentic Harness Engineering (AHE), a framework that automates harness-level evolution by instrumenting the three stages of any engineering loop (component editing, trajectory inspection, and decision making) with matched observability pillars: (1) component observability gives every editable harness component a file-level representation so the action space is explicit and revertible; (2) experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus that an evolving agent can actually consume; and (3) decision observability pairs every edit with a self-declared prediction, later verified against the next round's task-level outcomes. Together, these pillars turn every edit into a falsifiable contract, so harness evolution proceeds autonomously without collapsing into trial-and-error. Empirically, ten AHE iterations lift pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed harness Codex-CLI (71.9%) and the self-evolving baselines ACE and TF-GRPO. The frozen harness transfers without re-evolution: on SWE-bench-verified it tops aggregate success at 12% fewer tokens than the seed, and on Terminal-Bench 2 it yields +5.1 to +10.1pp cross-family gains across three alternate model families, indicating the evolved components encode general engineering experience rather than benchmark-specific tuning. These results position observability-driven evolution as a practical pathway to keep coding-agent harnesses continually improving.
翻译:框架已成为编码智能体性能的核心决定因素,塑造模型与代码库、工具及执行环境的交互方式。然而,框架工程自动化面临诸多挑战:异构动作空间、稀疏且含噪的评估信号、百万级token的轨迹数据,以及难以归因至下一轮结果的编辑影响。我们提出可观察性驱动的框架进化(Agentic Harness Engineering, AHE)框架,通过为工程循环的三个阶段(组件编辑、轨迹检查、决策制定)配置匹配的可观察性支柱实现框架级自动进化:(1)组件可观察性为每个可编辑框架组件提供文件级表征,使动作空间显式化且可回滚;(2)经验可观察性将百万级原始轨迹token蒸馏为分层可钻取的证据语料库,供进化智能体实际消费;(3)决策可观察性使每次编辑关联自声明预测,随后根据下一轮任务级结果验证。三大支柱共同将每次编辑转化为可证伪的契约,使框架自主进化而不陷入试错困境。实验表明,十次AHE迭代将终端基准测试2的pass@1从69.7%提升至77.0%,超越人类设计的框架Codex-CLI(71.9%)及自进化基线ACE与TF-GRPO。冻结的框架无需重新进化即可迁移:在SWE-bench-verified上以比初始框架少12%的token实现最高聚合成功率,在终端基准测试2上为三个替代模型家族带来+5.1至+10.1百分点的跨家族增益,表明进化组件编码的是通用工程经验而非特定基准调优。这些结果确立了可观察性驱动的进化作为持续改进编码智能体框架的实用路径。