TEASMA: A Practical Approach for the Test Assessment of Deep Neural Networks using Mutation Analysis

Successful deployment of Deep Neural Networks (DNNs), particularly in safety-critical systems, requires their validation with an adequate test set to ensure a sufficient degree of confidence in test outcomes. Mutation analysis, a well-established technique for measuring test adequacy in traditional software, has been adapted to DNNs in recent years. This technique is based on generating mutants that ideally aim to be representative of actual faults and thus can be used for test adequacy assessment. In this paper, we investigate for the first time whether and how mutation operators that directly modify the trained DNN model (i.e., post-training operators) can be used for reliably assessing the test inputs of DNNs. Our results show that these operators, though they do not aim to represent realistic faults, exhibit strong, non-linear relationships with faults. Inspired by this finding and considering the significant computational advantage of post-training operators compared to the operators that modify the training data or program (i.e., pre-training operators), we propose and evaluate TEASMA, an approach based on posttraining mutation for assessing the adequacy of DNNs test sets. In practice, TEASMA allows engineers to decide whether they will be able to trust test results and thus validate the DNN before its deployment. Based on a DNN model`s training set, TEASMA provides a methodology to build accurate DNNspecific prediction models of the Fault Detection Rate (FDR) of a test set from its mutation score, thus enabling its assessment. Our large empirical evaluation, across multiple DNN models, shows that predicted FDR values have a strong linear correlation (R2 >= 0.94) with actual values. Consequently, empirical evidence suggests that TEASMA provides a reliable basis for confidently deciding whether to trust test results or improve the test set of a DNN model.

翻译：深度神经网络（DNN）的成功部署，尤其是在安全关键系统中，需要借助充分的测试集进行验证，以确保测试结果具有足够的可信度。变异分析作为传统软件中衡量测试充分性的成熟技术，近年来已被适配至DNN领域。该技术通过生成旨在代表真实缺陷的变异体，进而用于测试充分性评估。本文首次探究了直接修改已训练DNN模型（即后训练算子）的变异算子能否可靠评估DNN测试输入。结果表明，尽管这些算子并非旨在模拟真实缺陷，但其与缺陷之间呈现出强非线性关系。基于这一发现，并考虑到后训练算子相较于修改训练数据或程序（即预训练算子）具有显著的计算优势，我们提出并评估了TEASMA方法——一种基于后训练变异评估DNN测试集充分性的方案。实际应用中，TEASMA可帮助工程师判断是否信任测试结果，从而在DNN部署前完成验证。该方法基于DNN模型的训练集，通过从测试集的变异分数构建其缺陷检测率（FDR）的高精度专用预测模型，实现对测试集的评估。我们在多个DNN模型上的大规模实证评估表明，预测的FDR值与实际值呈现强线性相关（R² ≥ 0.94）。因此，实证证据表明TEASMA为可信地决定是否信任测试结果或改进DNN模型测试集提供了可靠依据。