The Last Human-Written Paper: Agent-Native Research Artifacts

Jiachen Liu,Jiaxin Pei,Jintao Huang,Chenglei Si,Ao Qu,Xiangru Tang,Runyu Lu,Lichang Chen,Xiaoyan Bai,Haizhong Zheng,Carl Chen,Zhiyang Chen,Haojie Ye,Yujuan Fu,Zexue He,Zijian Jin,Zhenyu Zhang,Shangquan Sun,Maestro Harmon,John Dianzhuo Wang,Jianqiao Zeng,Jiachen Sun,Mingyuan Wu,Baoyu Zhou,Chenyu You,Shijian Lu,Yiming Qiu,Fan Lai,Yuan Yuan,Yao Li,Junyuan Hong,Ruihao Zhu,Beidi Chen,Alex Pentland,Ang Chen,Mosharaf Chowdhury,Zechen Zhang

from arxiv, 46 pages, 15 figures, 14 tables

Scientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along the way. This compilation imposes two structural costs: a Storytelling Tax, where failed experiments, rejected hypotheses, and the branching exploration process are discarded to fit a linear narrative; and an Engineering Tax, where the gap between reviewer-sufficient prose and agent-sufficient specification leaves critical implementation details unwritten. Tolerable for human readers, these costs become critical when AI agents must understand, reproduce, and extend published work. We introduce the Agent-Native Research Artifact (ARA), a protocol that replaces the narrative paper with a machine-executable research package structured around four layers: scientific logic, executable code with full specifications, an exploration graph that preserves the failures compilation discards, and evidence grounding every claim in raw outputs. Three mechanisms support the ecosystem: a Live Research Manager that captures decisions and dead ends during ordinary development; an ARA Compiler that translates legacy PDFs and repos into ARAs; and an ARA-native review system that automates objective checks so human reviewers can focus on significance, novelty, and taste. On PaperBench and RE-Bench, ARA raises question-answering accuracy from 72.4% to 93.7% and reproduction success from 57.4% to 64.4%. On RE-Bench's five open-ended extension tasks, preserved failure traces in ARA accelerate progress, but can also constrain a capable agent from stepping outside the prior-run box depending on the agent's capabilities. Our code is open-sourced at https://github.com/Orchestra-Research/Agent-Native-Research-Artifact.

翻译：科学出版物将分支迭代的研究过程压缩为线性叙事，丢弃了沿途发现的大部分内容。这种压缩机制带来了两种结构性代价：一是“叙事税”——失败的实验、被否定的假设和分支探索过程被迫舍弃以符合线性结构；二是“工程税”——评审者可见的通用描述与智能体所需的精确规范之间存在鸿沟，导致关键实现细节未被记录。这些代价对人类读者尚可容忍，但当AI智能体必须理解、复现和扩展已发表成果时便成为关键障碍。我们提出智能体原生研究文档（ARA），这是一种将叙事型论文替换为机器可执行研究包的新协议，其结构围绕四个层级：科学逻辑、完备规范的可执行代码、保留失败压缩过程的探索图谱，以及将每项主张锚定于原始输出的证据体系。支撑该生态系统的三项机制包括：在常规开发过程中捕获决策与死胡同的实时研究管理器、将遗留PDF与代码仓库转换为ARA格式的编译器，以及可自动化客观审查的ARA原生评审系统，使人类评审者能聚焦于重要程度、创新性与学术品味。在PaperBench和RE-Bench基准测试中，ARA将问答准确率从72.4%提升至93.7%，复现成功率从57.4%提升至64.4%。在RE-Bench的五项开放式扩展任务中，ARA保留的失败轨迹虽能加速进展，但也可能根据智能体的能力差异，限制高水平智能体突破既往实验框架的探索能力。我们的代码已开源在https://github.com/Orchestra-Research/Agent-Native-Research-Artifact。