Reproducibility is an important requirement in evolutionary computation, where results largely depend on computational experiments. In practice, reproducibility relies on how algorithms, experimental protocols, and artifacts are documented and shared. Despite growing awareness, there is still limited empirical evidence on the actual reproducibility levels of published work in the field. In this paper, we study the reproducibility practices in papers published in the Evolutionary Combinatorial Optimization and Metaheuristics track of the Genetic and Evolutionary Computation Conference over a ten-year period. We introduce a structured reproducibility checklist and apply it through a systematic manual assessment of the selected corpus. In addition, we propose RECAP (REproducibility Checklist Automation Pipeline), an LLM-based system that automatically evaluates reproducibility signals from paper text and associated code repositories. Our analysis shows that papers achieve an average completeness score of 0.62, and that 36.90% of them provide additional material beyond the manuscript itself. We demonstrate that automated assessment is feasible: RECAP achieves substantial agreement with human evaluators (Cohen's k of 0.67). Together, these results highlight persistent gaps in reproducibility reporting and suggest that automated tools can effectively support large-scale, systematic monitoring of reproducibility practices.
翻译:可复现性是进化计算领域的重要要求,该领域的研究结果在很大程度上依赖于计算实验。在实践中,可复现性取决于算法、实验方案及研究产物的记录与共享方式。尽管相关意识日益增强,但关于该领域已发表工作的实际可复现水平仍缺乏充分的实证证据。本文研究了遗传与进化计算会议中进化组合优化与元启发式专题十年间发表论文的可复现实践。我们引入结构化可复现性检查清单,通过对选定文献集的系统性人工评估加以应用。此外,我们提出了RECAP(可复现性检查清单自动化流程)——一个基于LLM的系统,可自动从论文文本及相关代码仓库中评估可复现性信号。分析表明,论文的平均完整性得分为0.62,其中36.90%的论文提供了稿件本身之外的补充材料。我们证明自动化评估具有可行性:RECAP与人工评估者达成显著一致性(Cohen's k系数为0.67)。这些结果共同揭示了可复现性报告中持续存在的缺口,并表明自动化工具能有效支持大规模、系统化的可复现实践监测。