AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self-supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel. The agent analyzes these rollouts using self-validation and self-consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self-preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long-horizon sessions.
翻译:AI智能体依赖由技能、工具和工作流程构成的工具集来解决复杂问题。持续改进这一工具集对于适应新任务至关重要。然而,现有优化方法通常需要真实标注验证集,但在实际部署场景中这类标注数据难以获取。为解决此问题,我们提出回顾式工具集优化(RHO)——一种仅利用历史轨迹进行智能体工具集优化的自监督方法。具体而言,RHO从历史轨迹中选取多样化的困难核心任务子集并行重新求解,智能体通过自我验证与自一致性分析这些轨迹,生成候选工具集更新方案,并依据自身成对自偏好选择最有效的方案。我们在涵盖软件工程、技术工作与知识工作的三个不同领域评估了RHO。值得注意的是,单轮优化即可将SWE-Bench Pro基准通过率从59%提升至78%,且无需任何外部评分。进一步分析表明,RHO能有效针对先前的失败模式。因此,优化后的工具集改变了智能体的行为模式,并在长周期任务中保持更高准确率。