TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural Networks

Successful deployment of Deep Neural Networks (DNNs) requires their validation with an adequate test set to ensure a sufficient degree of confidence in test outcomes. Although well-established test adequacy assessment techniques have been proposed for DNNs, we still need to investigate their application within a comprehensive methodology for accurately predicting the fault detection ability of test sets and thus assessing their adequacy. In this paper, we propose and evaluate TEASMA, a comprehensive and practical methodology designed to accurately assess the adequacy of test sets for DNNs. In practice, TEASMA allows engineers to decide whether they can trust high-accuracy test results and thus validate the DNN before its deployment. Based on a DNN model's training set, TEASMA provides a procedure to build accurate DNN-specific prediction models of the Fault Detection Rate (FDR) of a test set using an existing adequacy metric, thus enabling its assessment. We evaluated TEASMA with four state-of-the-art test adequacy metrics: Distance-based Surprise Coverage (DSC), Likelihood-based Surprise Coverage (LSC), Input Distribution Coverage (IDC), and Mutation Score (MS). Our extensive empirical evaluation across multiple DNN models and input sets such as ImageNet, reveals a strong linear correlation between the predicted and actual FDR values derived from MS, DSC, and IDC, with minimum R^2 values of 0.94 for MS and 0.90 for DSC and IDC. Furthermore, a low average Root Mean Square Error (RMSE) of 9% between actual and predicted FDR values across all subjects, when relying on regression analysis and MS, demonstrates the latter's superior accuracy when compared to DSC and IDC, with RMSE values of 0.17 and 0.18, respectively. Overall, these results suggest that TEASMA provides a reliable basis for confidently deciding whether to trust test results for DNN models.

翻译：深度神经网络（DNNs）的成功部署需要利用充分的测试集对其进行验证，以确保对测试结果具有足够的置信度。尽管目前已针对DNNs提出了成熟的测试充分性评估技术，但我们仍需研究如何将这些技术应用于一个综合性的方法学中，以准确预测测试集的故障检测能力，进而评估其充分性。本文提出并评估了TEASMA，这是一个全面且实用的方法学，旨在准确评估DNNs测试集的充分性。在实践中，TEASMA使工程师能够判断是否可以信任高准确率的测试结果，从而在DNN部署前对其进行验证。基于DNN模型的训练集，TEASMA提供了一个流程，能够利用现有的充分性度量指标，构建精确的、针对特定DNN的测试集故障检测率（FDR）预测模型，从而实现对其充分性的评估。我们使用四种最先进的测试充分性度量指标对TEASMA进行了评估：基于距离的意外覆盖率（DSC）、基于似然的意外覆盖率（LSC）、输入分布覆盖率（IDC）和变异分数（MS）。我们在多个DNN模型和输入集（例如ImageNet）上进行的大量实证评估表明，基于MS、DSC和IDC得出的预测FDR值与实际FDR值之间存在强线性相关性，其中MS的最小R^2值为0.94，DSC和IDC的最小R^2值为0.90。此外，当依赖回归分析和MS时，所有实验对象的实际与预测FDR值之间的平均均方根误差（RMSE）低至9%，这表明与RMSE值分别为0.17和0.18的DSC和IDC相比，MS具有更高的准确性。总体而言，这些结果表明，TEASMA为自信地决定是否信任DNN模型的测试结果提供了可靠依据。