The Last Human-Written Paper: Agent-Native Research Artifacts

Jiachen Liu,Jiaxin Pei,Jintao Huang,Chenglei Si,Ao Qu,Xiangru Tang,Runyu Lu,Lichang Chen,Xiaoyan Bai,Haizhong Zheng,Carl Chen,Zhiyang Chen,Haojie Ye,Yujuan Fu,Zexue He,Zijian Jin,Zhenyu Zhang,Shangquan Sun,Maestro Harmon,John Dianzhuo Wang,Jianqiao Zeng,Jiachen Sun,Mingyuan Wu,Baoyu Zhou,Yuchen You,Shijian Lu,Yiming Qiu,Fan Lai,Yuan Yuan,Yao Li,Junyuan Hong,Ruihao Zhu,Beidi Chen,Alex Pentland,Ang Chen,Mosharaf Chowdhury,Zechen Zhang

from arxiv, 45 pages, 15 figures, 14 tables

Scientific publication compresses a branching, iterative research process into a linear narrative, discarding the majority of what was discovered along the way. This compilation imposes two structural costs: a Storytelling Tax, where failed experiments, rejected hypotheses, and the branching exploration process are discarded to fit a linear narrative; and an Engineering Tax, where the gap between reviewer-sufficient prose and agent-sufficient specification leaves critical implementation details unwritten. Tolerable for human readers, these costs become critical when AI agents must understand, reproduce, and extend published work. We introduce the Agent-Native Research Artifact (Ara), a protocol that replaces the narrative paper with a machine-executable research package structured around four layers: scientific logic, executable code with full specifications, an exploration graph that preserves the failures compilation discards, and evidence grounding every claim in raw outputs. Three mechanisms support the ecosystem: a Live Research Manager that captures decisions and dead ends during ordinary development; an Ara Compiler that translates legacy PDFs and repos into Aras; and an Ara-native review system that automates objective checks so human reviewers can focus on significance, novelty, and taste. On PaperBench and RE-Bench, Ara raises question-answering accuracy from 72.4% to 93.7% and reproduction success from 57.4% to 64.4%. On RE-Bench's five open-ended extension tasks, preserved failure traces in Ara accelerate progress, but can also constrain a capable agent from stepping outside the prior-run box depending on the agent's capabilities.

翻译：科学出版将分叉迭代的研究过程压缩成线性叙事，丢弃了沿途发现的大部分内容。这种压缩施加了两种结构性成本：一是"叙事税"——失败的实验、被否定的假设以及分支探索过程被舍弃以符合线性叙述；二是"工程税"——审稿人可理解的散文与智能体可执行的规约之间存在鸿沟，导致关键实现细节未被记录。对于人类读者而言这些成本尚可容忍，但当AI智能体必须理解、复现和扩展已发表成果时，它们便成为关键瓶颈。我们提出智能体原生研究工件（Ara）协议，该协议以机器可执行的研究包替代叙事性论文，围绕四个层次构建：科学逻辑、包含完整规约的可执行代码、保留压缩所丢弃的失败记录的探索图，以及将每项声明锚定于原始输出的证据。三套机制支撑该生态系统：在常规开发过程中捕获决策与死胡同的实时研究管理器；将传统PDF和代码仓库转化为Aras的Ara编译器；以及实现客观检查自动化以便人类审稿人聚焦重要性、创新性与品味的原生Ara评审系统。在PaperBench和RE-Bench基准上，Ara将问答准确率从72.4%提升至93.7%，复现成功率从57.4%提升至64.4%。在RE-Bench的五个开放式扩展任务中，Ara保留的失败轨迹加速了研究进展，但根据智能体的能力差异，也可能限制其突破先前实验界限的能力。