Towards Enhancing the Reproducibility of Deep Learning Bugs: An Empirical Study

Context: Deep learning has achieved remarkable progress in various domains. However, like any software system, deep learning systems contain bugs, some of which can have severe impacts, as evidenced by crashes involving autonomous vehicles. Despite substantial advancements in deep learning techniques, little research has focused on reproducing deep learning bugs, which is an essential step for their resolution. Existing literature suggests that only 3% of deep learning bugs are reproducible, underscoring the need for further research. Objective: This paper examines the reproducibility of deep learning bugs. We identify edit actions and useful information that could improve the reproducibility of deep learning bugs. Method: First, we construct a dataset of 668 deep-learning bugs from Stack Overflow and GitHub across three frameworks and 22 architectures. Second, out of the 668 bugs, we select 165 bugs using stratified sampling and attempt to determine their reproducibility. While reproducing these bugs, we identify edit actions and useful information for their reproduction. Third, we used the Apriori algorithm to identify useful information and edit actions required to reproduce specific types of bugs. Finally, we conducted a user study involving 22 developers to assess the effectiveness of our findings in real-life settings. Results: We successfully reproduced 148 out of 165 bugs attempted. We identified ten edit actions and five useful types of component information that can help us reproduce the deep learning bugs. With the help of our findings, the developers were able to reproduce 22.92% more bugs and reduce their reproduction time by 24.35%. Conclusions: Our research addresses the critical issue of deep learning bug reproducibility. Practitioners and researchers can leverage our findings to improve deep learning bug reproducibility.

翻译：背景：深度学习已在多个领域取得显著进展。然而，与任何软件系统一样，深度学习系统也存在缺陷，其中部分可能造成严重后果，自动驾驶车辆相关事故即为明证。尽管深度学习技术已取得长足进步，但针对深度学习缺陷复现的研究却十分有限，而缺陷复现正是问题解决的关键步骤。现有文献表明仅3%的深度学习缺陷具备可复现性，凸显了深化研究的必要性。目标：本文系统探究深度学习缺陷的可复现性问题，旨在识别能够提升缺陷复现率的编辑操作与有效信息。方法：首先，我们从Stack Overflow和GitHub平台收集涵盖三大框架与22种架构的668个深度学习缺陷构建数据集。其次，采用分层抽样从668个缺陷中选取165个案例，尝试确定其可复现性。在复现过程中，我们系统记录复现所需的编辑操作与有效信息。第三，运用Apriori算法识别特定类型缺陷复现所需的关键信息与编辑操作。最后，我们开展了涉及22名开发者的用户研究，以评估研究发现在实际场景中的有效性。结果：在尝试复现的165个缺陷中，我们成功复现了148个。研究识别出十类编辑操作与五类有效的组件信息，这些要素能够显著促进深度学习缺陷的复现。基于本研究的发现，开发者成功复现的缺陷数量提升了22.92%，复现时间缩短了24.35%。结论：本研究针对深度学习缺陷可复现性这一关键问题提出了解决方案。从业者与研究人员可借助本研究成果有效提升深度学习缺陷的复现能力。