Artifact evaluation has been adopted in the Software Engineering (SE) research community for 15 years, substantially improving research reproducibility across major SE conferences. However, this success has introduced a growing scalability challenge, as artifact evaluation relies heavily on reviewers' manual execution and debugging, leading to escalating human effort amid rapidly increasing paper submissions. To address this problem, we investigate automated artifact evaluation. We first conduct a preliminary study on artifacts from top-tier SE conferences and identify three key challenges: perceiving execution states, maintaining stable execution environments, and recovering from execution errors. Inspired by these findings, we propose ArtifactCopilot, the first end-to-end agent-based framework for automated artifact evaluation. ArtifactCopilot automates environment construction, instruction execution, and error recovery by combining an execution normalization strategy to ensure environment stability with an artifact evaluation graph that transforms README documents into dependency-aware command graphs, enabling structured execution planning, execution-state tracking, and error recovery. Evaluation on 48 real-world artifacts shows that ArtifactCopilot matches human artifact evaluation outcomes for 85.42% of the artifacts, outperforming Claude Code by 52.09 percentage points, while costing only \$0.091 per artifact on average and requiring zero human intervention for 45 out of 48 artifacts.
翻译:软件制品评估已在软件工程研究社区推行十五年,显著提升了主流SE会议的研究可复现性。然而,这一成功也带来了日益严峻的可扩展性挑战:由于制品评估高度依赖评审人员的手动执行与调试,在论文投稿量快速增长背景下,人力成本持续攀升。为解决该问题,本研究探索自动化制品评估方法。我们首先对顶级SE会议的制品开展初步研究,识别出三大关键挑战:执行状态感知、稳定执行环境维持以及执行错误恢复。基于这些发现,我们提出首个端到端基于智能体的自动化制品评估框架ArtifactCopilot。该框架通过结合确保环境稳定性的执行规范化策略与将README文档转化为依赖感知命令图的制品评估图,实现了环境构建、指令执行和错误恢复的自动化,支持结构化执行规划、执行状态跟踪及错误恢复机制。在48个真实世界制品上的评估表明,ArtifactCopilot在85.42%的制品评估中达到人类评估水平,较Claude Code提升52.09个百分点,单制品平均成本仅0.091美元,且在48个制品中有45个实现零人工干预。