Towards Enhancing the Reproducibility of Deep Learning Bugs: An Empirical Study

Context: Deep learning has achieved remarkable progress in various domains. However, like traditional software systems, deep learning systems contain bugs, which can have severe impacts, as evidenced by crashes involving autonomous vehicles. Despite substantial advancements in deep learning techniques, little research has focused on reproducing deep learning bugs, which hinders resolving them. Existing literature suggests that only 3% of deep learning bugs are reproducible, underscoring the need for further research. Objective: This paper examines the reproducibility of deep learning bugs. We identify edit actions and useful information that could improve deep learning bug reproducibility. Method: First, we construct a dataset of 668 deep learning bugs from Stack Overflow and Defects4ML across 3 frameworks and 22 architectures. Second, we select 102 bugs using stratified sampling and try to determine their reproducibility. While reproducing these bugs, we identify edit actions and useful information necessary for their reproduction. Third, we used the Apriori algorithm to identify useful information and edit actions required to reproduce specific bug types. Finally, we conduct a user study with 22 developers to assess the effectiveness of our findings in real-life settings. Results: We successfully reproduced 85 bugs and identified ten edit actions and five useful information categories that can help us reproduce deep learning bugs. Our findings improved bug reproducibility by 22.92% and reduced reproduction time by 24.35% based on our user study. Conclusions: Our research addresses the critical issue of deep learning bug reproducibility. Practitioners and researchers can leverage our findings to improve deep learning bug reproducibility.

翻译：上下文：深度学习在多个领域取得了显著进展。然而，与传统软件系统类似，深度学习系统也包含错误，这些错误可能产生严重影响，例如涉及自动驾驶车辆的崩溃事件。尽管深度学习技术取得了重大进步，但针对深度学习错误可复现性的研究仍十分有限，这阻碍了错误解决。现有文献表明，仅有3%的深度学习错误可复现，凸显了进一步研究的必要性。目标：本文研究深度学习错误的可复现性。我们识别出可提升深度学习错误可复现性的编辑操作和有用信息。方法：首先，我们从Stack Overflow和Defects4ML中构建了包含668个深度学习错误的数据集，涵盖3个框架和22种架构。其次，采用分层抽样选取102个错误，尝试确定其可复现性。在复现过程中，我们识别出必要的编辑操作和有用信息。第三，运用Apriori算法确定复现特定错误类型所需的有用信息和编辑操作。最后，我们对22名开发者开展用户研究，评估研究结果在实际场景中的有效性。结果：我们成功复现了85个错误，并识别出十种编辑操作和五类有用信息，这些有助于复现深度学习错误。用户研究表明，我们的发现将错误可复现性提升了22.92%，复现时间减少了24.35%。结论：本研究解决了深度学习错误可复现性的关键问题。从业者和研究人员可借助我们的发现提升深度学习错误的可复现性。