Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

Jiajie Jin,Yuyang Hu,Kai Qiu,Qi Dai,Chong Luo,Guanting Dong,Xiaoxi Li,Tong Zhao,Xiaolong Ma,Gongrui Zhang,Zhirong Wu,Bei Liu,Zhengyuan Yang,Linjie Li,Lijuan Wang,Hongjin Qian,Yutao Zhu,Zhicheng Dou

Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.

翻译：科学进步依赖于探索、实验与抽象之反复循环。研究者测试候选方向，解读证据，并将所得经验融入后续尝试。本文探究人工智能代理如何自主地跨长期时间范围执行此循环。我们提出Arbor——一个通用自主科研框架，该框架将长期协调器、短期执行器与假设树优化（HTR）相结合，后者是一种持久化树结构，可将假设、产物、证据及提炼出的洞见随时间跨连。协调器负责管理树上全局研究策略，执行器则在独立工作树中实现并测试单个假设。随着结果返回，Arbor更新树结构、传播可复用经验、优化搜索前沿，并纳入了经验证的改进。此设计将自主科研从局部尝试序列转化为累积过程，使策略、执行与证据得以跨时间传承。我们在自主优化（AO）这一操作场景下评估Arbor——其中智能体通过迭代实验改进初始研究产物，无需逐步人工监督。在涉及模型训练、工具工程及数据合成等六项真实研究任务中，Arbor均取得最佳保留指标结果，在相同任务接口与资源预算下，相较于Codex与Claude Code，平均相对保留增益提升逾2.5倍。在MLE-Bench Lite测试中，搭载GPT-5.5的Arbor达到86.36%任意奖牌率，为本文对比中最优结果。