Bug Characterization in Machine Learning-based Systems

Rapid growth of applying Machine Learning (ML) in different domains, especially in safety-critical areas, increases the need for reliable ML components, i.e., a software component operating based on ML. Understanding the bugs characteristics and maintenance challenges in ML-based systems can help developers of these systems to identify where to focus maintenance and testing efforts, by giving insights into the most error-prone components, most common bugs, etc. In this paper, we investigate the characteristics of bugs in ML-based software systems and the difference between ML and non-ML bugs from the maintenance viewpoint. We extracted 447,948 GitHub repositories that used one of the three most popular ML frameworks, i.e., TensorFlow, Keras, and PyTorch. After multiple filtering steps, we select the top 300 repositories with the highest number of closed issues. We manually investigate the extracted repositories to exclude non-ML-based systems. Our investigation involved a manual inspection of 386 sampled reported issues in the identified ML-based systems to indicate whether they affect ML components or not. Our analysis shows that nearly half of the real issues reported in ML-based systems are ML bugs, indicating that ML components are more error-prone than non-ML components. Next, we thoroughly examined 109 identified ML bugs to identify their root causes, symptoms, and calculate their required fixing time. The results also revealed that ML bugs have significantly different characteristics compared to non-ML bugs, in terms of the complexity of bug-fixing (number of commits, changed files, and changed lines of code). Based on our results, fixing ML bugs are more costly and ML components are more error-prone, compared to non-ML bugs and non-ML components respectively. Hence, paying a significant attention to the reliability of the ML components is crucial in ML-based systems.

翻译：机器学习（ML）在不同领域（尤其是安全关键领域）的快速应用，增加了对可靠ML组件（即基于ML运行的软件组件）的需求。理解ML系统中错误的特征和维护挑战，有助于这些系统的开发者通过揭示最易错的组件、最常见的错误等，确定维护和测试工作的重点。本文从维护视角研究了基于机器学习的软件系统中错误的特征，以及ML错误与非ML错误之间的差异。我们提取了447,948个使用三大流行ML框架（即TensorFlow、Keras和PyTorch）的GitHub仓库。经过多步筛选，我们选择了已关闭问题数量最多的前300个仓库。我们对提取的仓库进行手动检查，以排除非ML系统。我们的调查涉及对识别出的ML系统中386个抽样报告问题的手动检查，以判断它们是否影响了ML组件。分析显示，ML系统中报告的真实问题中近一半是ML错误，表明ML组件比非ML组件更易出错。接着，我们深入检查了109个已识别的ML错误，以确定其根本原因、症状并计算修复所需时间。结果还揭示，ML错误在修复复杂性（提交次数、变更文件和代码行数）方面与非ML错误存在显著差异。基于我们的结果，与非ML错误和非ML组件相比，修复ML错误成本更高，且ML组件更易出错。因此，在基于ML的系统中，高度关注ML组件的可靠性至关重要。