Artisan: Agentic Artifact Evaluation

Artifact evaluation has become standard practice in the software engineering community to ensure the reproducibility of research results. However, the current manual process is labor-intensive, and hence, done only as a one-time assessment for a subset of all papers. To support the artifact evaluation effort, we present Artisan, an automated LLM agent for reproducing research results given a paper and its artifact. The approach is enabled by two key contributions: First, we frame the reproduction problem as a code generation task where the goal is to generate a reproduction script that, when executed, reproduces the results reported in a paper. Unlike prior work on automatically reproducing research results in other domains, this formulation allows for running the script independently of the agent and for assessing the reproduction process at a fine-grained level. Second, we design automated judging mechanism that guides the agent toward the expected results without revealing them and that prevent trivial solutions, such as simply copying checked-in results. To evaluate Artisan, we introduce Artisan-Bench, the first benchmark assessing the ability to generate reproduction scripts and the first benchmark for automated artifact evaluation in software engineering. Artisan-Bench comprises 60 tasks derived from 23 software engineering papers, covering different research areas and programming languages. We validate all tasks in Artisan-Bench for reproducibility to ensure that the tasks are feasible. Our experiments show that Artisan is effective, producing 44/60 reproduction scripts and outperforming the best available baseline, a vanilla LLM agent (mini-swe-agent), by 3.14$\times$ in terms of reproduction scripts generated while taking $0.45 and 48 minutes, on average per task. Artisan also helped uncover 20 new errors in either the paper or artifact.

翻译：制品评估已成为软件工程领域确保研究成果可复现性的标准实践。然而，当前的人工评估流程劳动密集度高，因此仅能对部分论文进行一次性评估。为支持制品评估工作，我们提出了Artisan——一个基于大型语言模型的自动化智能体，能够在给定论文及其制品的情况下复现研究成果。该方法得益于两项关键贡献：首先，我们将复现问题构建为代码生成任务，其目标是生成一个复现脚本，该脚本在执行时能够复现论文中报告的结果。与先前在其他领域自动复现研究成果的工作不同，该框架允许独立于智能体运行脚本，并支持在细粒度层面评估复现过程。其次，我们设计了自动化评判机制，该机制在不透露预期结果的前提下引导智能体接近目标，并防止出现直接复制已提交结果等简单解决方案。为评估Artisan，我们构建了Artisan-Bench——首个用于评估复现脚本生成能力的基准测试，也是软件工程领域首个面向自动化制品评估的基准。Artisan-Bench包含从23篇软件工程论文中提取的60项任务，涵盖不同研究领域和编程语言。我们验证了Artisan-Bench中所有任务的可复现性，以确保任务的可行性。实验表明Artisan具有显著效果：成功生成44/60个复现脚本，在生成复现脚本数量上以3.14$\times$的优势超越当前最佳基线方法（基础版LLM智能体mini-swe-agent），平均每项任务仅需0.45美元和48分钟。Artisan还帮助发现了论文或制品中存在的20个新错误。