Towards Enhancing the Reproducibility of Deep Learning Bugs: An Empirical Study

Context: Deep learning has achieved remarkable progress in various domains. However, like any software system, deep learning systems contain bugs, some of which can have severe impacts, as evidenced by crashes involving autonomous vehicles. Despite substantial advancements in deep learning techniques, little research has focused on reproducing deep learning bugs, which is an essential step for their resolution. Existing literature suggests that only 3% of deep learning bugs are reproducible, underscoring the need for further research. Objective: This paper examines the reproducibility of deep learning bugs. We identify edit actions and useful information that could improve the reproducibility of deep learning bugs. Method: First, we construct a dataset of 668 deep-learning bugs from Stack Overflow and GitHub across three frameworks and 22 architectures. Second, out of the 668 bugs, we select 165 bugs using stratified sampling and attempt to determine their reproducibility. While reproducing these bugs, we identify edit actions and useful information for their reproduction. Third, we used the Apriori algorithm to identify useful information and edit actions required to reproduce specific types of bugs. Finally, we conducted a user study involving 22 developers to assess the effectiveness of our findings in real-life settings. Results: We successfully reproduced 148 out of 165 bugs attempted. We identified ten edit actions and five useful types of component information that can help us reproduce the deep learning bugs. With the help of our findings, the developers were able to reproduce 22.92% more bugs and reduce their reproduction time by 24.35%. Conclusions: Our research addresses the critical issue of deep learning bug reproducibility. Practitioners and researchers can leverage our findings to improve deep learning bug reproducibility.

翻译：背景：深度学习在各个领域取得了显著进展。然而，与任何软件系统一样，深度学习系统也存在缺陷，其中一些可能产生严重影响，自动驾驶车辆相关事故即为明证。尽管深度学习技术已取得重大进步，但针对深度学习缺陷复现的研究却相对匮乏，而缺陷复现是解决这些问题的关键步骤。现有文献表明，仅有3%的深度学习缺陷具备可复现性，这凸显了进一步研究的必要性。目标：本文旨在探究深度学习缺陷的可复现性问题。我们识别了能够提升深度学习缺陷可复现性的编辑操作与有效信息。方法：首先，我们从Stack Overflow和GitHub平台收集了涵盖三种框架和22种架构的668个深度学习缺陷构建数据集。其次，通过分层抽样从668个缺陷中选取165个缺陷，尝试确定其可复现性。在复现这些缺陷的过程中，我们识别了有助于缺陷复现的编辑操作与有效信息。第三，我们运用Apriori算法识别了复现特定类型缺陷所需的有效信息与编辑操作。最后，我们开展了涉及22名开发者的用户研究，以评估研究发现在实际场景中的有效性。结果：在尝试复现的165个缺陷中，我们成功复现了148个。我们识别出十种编辑操作和五类有效的组件信息，这些要素能够协助复现深度学习缺陷。借助我们的研究发现，开发者能够多复现22.92%的缺陷，并将复现时间缩短24.35%。结论：本研究针对深度学习缺陷可复现性这一关键问题提出了解决方案。从业者与研究人员可借鉴我们的研究发现来提升深度学习缺陷的可复现性。