ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

Rui Meng,Bhavana Dalvi Mishra,Jiefeng Chen,Chun-Liang Li,Palash Goyal,Mihir Parmar,Yiwen Song,Yale Song,Rajarishi Sinha,Parthasarathy Ranganathan,Burak Gokturk,Jinsung Yoon,Tomas Pfister

from arxiv, Project website: https://scientist-one.github.io/

Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.

翻译：[translated abstract in Chinese] 自主研究智能体能够生成有竞争力的解决方案和专业的稿件，但其输出存在表面评估无法察觉的可验证性缺陷：捏造的引用、无法复现的得分，以及与方法实现相悖的描述。我们通过三项贡献解决此问题。首先，提出证据链（Chain-of-Evidence, CoE），一种要求所有陈述均可追溯至其证据来源的可验证性框架。其次，构建ScientistOne，一个端到端自主研究系统，通过设计在文献综述、解决方案发现和论文撰写全过程中维持证据链。第三，提出CoE审计，一种事后审计方法，其四项完整性检查——分数验证、规范违背、引用验证及方法-代码对齐——统一适用于所有系统。在涵盖五个系统与五项前沿研究任务的75篇论文中，每个基线系统均至少存在一种系统性失效模式：幻觉引用率高达21%，分数验证通过率最低仅42%，方法-代码对齐率在20%至80%间波动。ScientistOne实现了零幻觉引用（0/337）、完美分数验证（12/12）及最高方法-代码对齐率（14/15），同时在全部五项任务上达到或超越人类专家水平。进一步地，ScientistOne泛化至涵盖医学影像、细粒度识别、三维感知与语言建模的六项额外任务，在Parameter Golf任务上取得最先进性能，并在基线系统完全失效的MLE-Bench任务中获得金牌。