Large language models are increasingly deployed as autonomous coding agents and have achieved remarkably strong performance on software engineering benchmarks. However, it is unclear whether such success transfers to computational scientific workflows, where tasks require not only strong coding ability, but also the ability to navigate complex, domain-specific procedures and to interpret results in the context of scientific claims. To address this question, we present AutoMat, a benchmark for evaluating LLM-based agents' ability to reproduce claims from computational materials science. AutoMat poses three interrelated challenges: recovering underspecified computational procedures, navigating specialized toolchains, and determining whether the resulting evidence supports a claim. By working closely with subject matter experts, we curate a set of claims from real materials science papers to test whether coding agents can recover and execute the end-to-end workflow needed to support (or undermine) such claims. We then evaluate multiple representative coding agent settings across several foundation models. Our results show that current LLM-based agents obtain low overall success rates on AutoMat, with the best-performing setting achieving a success rate of only 54.1%. Error analysis further reveals that agents perform worst when workflows must be reconstructed from paper text alone and that they fail primarily due to incomplete procedures, methodological deviations, and execution fragility. Taken together, these findings position AutoMat as both a benchmark for computational scientific reproducibility and a tool for diagnosing the current limitations of agentic systems in AI-for-science settings.
翻译:大型语言模型正越来越多地被部署为自主编码代理,并在软件工程基准测试中取得了极为强劲的表现。然而,尚不清楚此类成功是否能迁移至计算科学工作流中,因为这些任务不仅要求强大的编码能力,还需要在复杂、领域特定的程序中导航,并根据科学论断的背景解读结果。为解决这一问题,我们提出了 AutoMat,一个评估基于 LLM 的代理能否复现计算材料科学中研究发现的基准。AutoMat 提出了三个相互关联的挑战:恢复未明确指定的计算程序、导航专用工具链,以及判断所得证据是否支持某一论断。通过与领域专家紧密合作,我们从真实材料科学论文中精选了一系列论断,以测试编码代理能否恢复并执行支持(或反驳)这些论断所需的端到端工作流。随后,我们在多种基础模型上评估了多个具有代表性的编码代理设置。结果表明,当前基于 LLM 的代理在 AutoMat 上的总体成功率较低,性能最佳设置的达成率仅为 54.1%。错误分析进一步揭示,当工作流必须仅依据论文原文重构时,代理表现最差,且失败的主要原因在于程序不完整、方法偏差以及执行脆弱性。综合来看,这些发现使 AutoMat 既成为计算科学可复现性的基准,也成为诊断 AI 驱动科学领域当前代理系统局限性的工具。