Reproducibility in empirical software engineering relies on complete, accessible, and reusable research artifacts, yet artifact evaluation remains largely manual and difficult to scale. This emerging results paper explores an agentic approach for assessing replication package quality by translating open-science guidelines into machine-verifiable criteria. We consolidate 380 requirements from 34 sources into 51 reproducibility criteria, of which 31 are operationalized for automated artifact-based evaluation. Based on these criteria, we implement a multi-agent prototype that automatically inspects replication packages and produces evidence-grounded improvement reports. A preliminary evaluation on five replication packages shows high inter-run consistency of 91.4\% and 75.4\% correctness, through micro-averaged agreement with a manual baseline. The agent performs best on structural criteria such as code, environment, and artifact availability, but struggles with qualitative or mixed-method studies. A pilot survey with seven software engineering researchers indicates well perceived usefulness and adoption potential, while revealing cognitive load in the human-in-the-loop planning step. Overall, these emerging results indicate that agentic research artifact evaluation has the potential to support authors and reviewers by automating selected routine checks.
翻译:经验软件工程中的可复现性依赖于完整、可访问且可重用的研究构件,然而目前构件评估仍主要依赖人工且难以规模化。这篇新兴成果论文探索了一种智能体方法,通过将开放科学指南转化为机器可验证的准则来评估复现包质量。我们将来自34个文献源的380项要求整合为51条可复现性准则,其中31条已可操作化用于基于构件的自动化评估。基于这些准则,我们实现了一个多智能体原型系统,可自动检查复现包并生成基于证据的改进报告。对五个复现包的初步评估显示,通过与人工基线的微平均一致性比较,运行间一致率达91.4%,正确率达75.4%。该智能体在代码、环境和构件可用性等结构性准则上表现最优,但在定性研究或混合方法研究中存在困难。针对七位软件工程研究人员的试点调查表明,受访者普遍认可其有用性和应用潜力,同时揭示了人类参与规划环节中的认知负荷问题。总体而言,这些初步结果表明,基于智能体的研究构件评估有望通过对选定常规检查的自动化来支持作者和审稿人。