Replication packages are crucial for enabling transparency, validation, and reuse in software engineering (SE) research. While artifact sharing is now a standard practice and even expected at premier SE venues such as ICSE, the practical usability of these replication packages remains underexplored. In particular, there is a marked lack of studies that comprehensively examine the executability and reproducibility of replication packages in SE research. In this paper, we aim to fill this gap by evaluating 100 replication packages published as part of ICSE proceedings over the past decade (2015--2024). We assess the (1) executability of the replication packages, (2) efforts and modifications required to execute them, (3) challenges that prevent executability, and (4) reproducibility of the original findings. We spent approximately 650 person-hours in total executing the artifacts and reproducing the study findings. Our findings reveal that only 40\% of the 100 evaluated artifacts were executable, of which 32.5\% (13 out of 40) ran without any modification. Regarding effort levels, 17.5\% (7 out of 40) required low effort, while 82.5\% (33 out of 40) required moderate to high effort to execute successfully. We identified five common types of modifications and 13 challenges leading to execution failure, spanning environmental, documentation, and structural issues. Among the executable artifacts, only 35\% (14 out of 40) reproduced the original results. These findings highlight a notable gap between artifact availability, executability, and reproducibility. Our study proposes three actionable guidelines to improve the preparation, documentation, and review of research artifacts, thereby strengthening the rigor and sustainability of open science practices in SE research.
翻译:复现包对于实现软件工程研究的透明度、验证和重用至关重要。尽管研究制品共享现已成为标准实践,甚至在ICSE等顶级软件工程会议上成为常规要求,但这些复现包的实际可用性仍未得到充分探索。特别是,目前明显缺乏对软件工程研究中复现包可执行性和可复现性的全面考察。本文旨在通过评估过去十年间作为ICSE会议论文集组成部分发表的100个复现包来填补这一空白。我们从以下维度进行评估:(1) 复现包的可执行性,(2) 执行所需的工作量和修改,(3) 导致无法执行的主要挑战,以及 (4) 原始研究结果的可复现性。我们总计投入约650人时来执行这些研究制品并复现研究结果。研究发现:在评估的100个研究制品中,仅有40%具备可执行性,其中32.5%(40个中的13个)无需任何修改即可运行。在工作量方面,17.5%(40个中的7个)需要较低工作量,而82.5%(40个中的33个)需要中等到高强度的工作量才能成功执行。我们识别出导致执行失败的五类常见修改需求和13种挑战,涵盖环境配置、文档说明和结构设计等问题。在可执行的研究制品中,仅35%(40个中的14个)成功复现了原始结果。这些发现凸显了研究制品可用性、可执行性与可复现性之间存在显著差距。本研究提出三项可操作的指导原则,以改进研究制品的准备、文档编制和评审流程,从而增强软件工程研究中开放科学实践的严谨性和可持续性。