AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published paper, implement the corresponding algorithms from scratch, and produce quantitative results matching the original publication. Agents are provided only with the task instruction and paper content, and operate in a sandboxed execution environment. All tasks are contributed by domain experts from over 20 research groups at the School of Physics, Peking University, each grounded in a real published paper and validated through end-to-end reproduction with verified ground-truth results and detailed scoring rubrics. Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution. The best-performing agent, OpenAI Codex powered by GPT-5.3-Codex, achieves a mean overall score of 34%. All agents exhibit a zero end-to-end callback success rate, with particularly poor performance in data accuracy and code correctness. We further identify systematic failure modes, including errors in formula implementation, inability to debug numerical simulations, and fabrication of output data. Overall, PRBench provides a rigorous benchmark for evaluating progress toward autonomous scientific research.
翻译:由大语言模型驱动的人工智能代理展现出强大的推理与问题解决能力,能够协助完成公式推导、代码生成等科学研究任务。然而,这些代理能否可靠地实现真实科学论文的端到端复现,仍是一个开放性问题。我们提出PRBench基准测试,包含30个经专家精心设计的任务,涵盖物理学的11个子领域。每项任务要求代理理解已发表论文的方法论,从零实现对应算法,并生成与原始论文匹配的定量结果。代理仅获得任务指令与论文内容,在沙盒执行环境中运行。所有任务均由北京大学物理学院20余个研究组的领域专家贡献,每个任务均基于真实已发表论文,并经过端到端复现验证,附有可验证的基准真相结果与详细评分标准。通过代理化评估流程,我们在PRBench上评测了多组代码代理,并分析了其在科学推理与执行关键维度的能力。最佳代理(由GPT-5.3-Codex驱动的OpenAI Codex)平均总分仅达34%。所有代理的端到端回调成功率均为零,尤其在数据准确性与代码正确性方面表现较差。我们进一步识别出系统性故障模式,包括公式实现错误、无法调试数值模拟、以及输出数据伪造。总体而言,PRBench为评估自主科学研究的进展提供了严格基准。