Artifact evaluation has been adopted in the Software Engineering (SE) research community for 15 years, substantially improving research reproducibility across major SE conferences. However, this success has introduced a growing scalability challenge, as artifact evaluation relies heavily on reviewers' manual execution and debugging, leading to escalating human effort amid rapidly increasing paper submissions. To address this problem, we investigate automated artifact evaluation. We first conduct a preliminary study on artifacts from top-tier SE conferences and identify three key challenges: perceiving execution states, maintaining stable execution environments, and recovering from execution errors. Inspired by these findings, we propose ArtifactCopilot, the first end-to-end agent-based framework for automated artifact evaluation. ArtifactCopilot automates environment construction, instruction execution, and error recovery by combining an execution normalization strategy to ensure environment stability with an artifact evaluation graph that transforms README documents into dependency-aware command graphs, enabling structured execution planning, execution-state tracking, and error recovery. Evaluation on 48 real-world artifacts shows that ArtifactCopilot matches human artifact evaluation outcomes for 85.42% of the artifacts, outperforming Claude Code by 52.09 percentage points, while costing only \$0.091 per artifact on average and requiring zero human intervention for 45 out of 48 artifacts.
翻译:制品评估在软件工程研究领域已推行十五年,显著提升了主流软件工程会议的研究可复现性。然而,这一成功也带来了日益严峻的可扩展性挑战:由于制品评估高度依赖评审人员的手动执行与调试,在论文投稿量急剧增长的背景下,人力成本持续攀升。为解决该问题,本研究探索自动化制品评估方法。我们首先对顶级软件工程会议的制品开展初步研究,识别出三大核心挑战:执行状态感知、稳定执行环境维持及执行错误恢复。基于这些发现,我们提出首个端到端基于智能体的自动化制品评估框架ArtifactCopilot。该框架通过融合执行标准化策略(确保环境稳定性)与制品评估图(将README文档转化为具备依赖感知的命令图),实现了环境构建、指令执行与错误恢复的自动化,支持结构化执行规划、执行状态跟踪及错误恢复机制。在48个真实世界制品上的评估表明:ArtifactCopilot在85.42%的制品评估中与人工评估结果一致,较Claude Code提升52.09个百分点,平均单制品评估成本仅0.091美元,且在48个制品中有45个实现零人工干预。