Studying the Practices of Testing Machine Learning Software in the Wild

Background: We are witnessing an increasing adoption of machine learning (ML), especially deep learning (DL) algorithms in many software systems, including safety-critical systems such as health care systems or autonomous driving vehicles. Ensuring the software quality of these systems is yet an open challenge for the research community, mainly due to the inductive nature of ML software systems. Traditionally, software systems were constructed deductively, by writing down the rules that govern the behavior of the system as program code. However, for ML software, these rules are inferred from training data. Few recent research advances in the quality assurance of ML systems have adapted different concepts from traditional software testing, such as mutation testing, to help improve the reliability of ML software systems. However, it is unclear if any of these proposed testing techniques from research are adopted in practice. There is little empirical evidence about the testing strategies of ML engineers. Aims: To fill this gap, we perform the first fine-grained empirical study on ML testing practices in the wild, to identify the ML properties being tested, the followed testing strategies, and their implementation throughout the ML workflow. Method: First, we systematically summarized the different testing strategies (e.g., Oracle Approximation), the tested ML properties (e.g., Correctness, Bias, and Fairness), and the testing methods (e.g., Unit test) from the literature. Then, we conducted a study to understand the practices of testing ML software. Results: In our findings: 1) we identified four (4) major categories of testing strategy including Grey-box, White-box, Black-box, and Heuristic-based techniques that are used by the ML engineers to find software bugs. 2) We identified 16 ML properties that are tested in the ML workflow.

翻译：背景：我们正目睹机器学习（ML），特别是深度学习（DL）算法在众多软件系统中的日益广泛应用，包括医疗系统或自动驾驶汽车等安全关键系统。确保这些系统的软件质量对研究界而言仍是一个开放挑战，这主要源于ML软件系统的归纳特性。传统上，软件系统通过将支配系统行为的规则编写为程序代码的演绎方式构建，而ML软件则从训练数据中推断这些规则。近期在ML系统质量保证方面的少量研究进展，已借鉴传统软件测试中的不同概念（如变异测试）来帮助提升ML软件系统的可靠性。然而，尚不清楚这些研究中提出的测试技术是否在实践中得到采纳。关于ML工程师的测试策略，目前缺乏实证证据。目的：为填补这一空白，我们首次对实地环境中的ML测试实践进行细粒度实证研究，以识别被测试的ML属性、遵循的测试策略及其在整个ML工作流中的实现方式。方法：首先，我们系统总结了文献中不同的测试策略（如Oracle近似）、被测ML属性（如正确性、偏差与公平性）及测试方法（如单元测试）。随后，我们开展了一项针对ML软件测试实践的研究。结果：研究发现：1）识别出ML工程师用于发现软件缺陷的四大类测试策略，包括灰盒、白盒、黑盒及基于启发式的技术；2）识别出ML工作流中涉及的16项被测ML属性。