Successful deployment of Deep Neural Networks (DNNs) requires their validation with an adequate test set to ensure a sufficient degree of confidence in test outcomes. Although well-established test adequacy assessment techniques have been proposed for DNNs, we still need to investigate their application within a comprehensive methodology for accurately predicting the fault detection ability of test sets and thus assessing their adequacy. In this paper, we propose and evaluate TEASMA, a comprehensive and practical methodology designed to accurately assess the adequacy of test sets for DNNs. In practice, TEASMA allows engineers to decide whether they can trust high-accuracy test results and thus validate the DNN before its deployment. Based on a DNN model's training set, TEASMA provides a procedure to build accurate DNN-specific prediction models of the Fault Detection Rate (FDR) of a test set using an existing adequacy metric, thus enabling its assessment. We evaluated TEASMA with four state-of-the-art test adequacy metrics: Distance-based Surprise Coverage (DSC), Likelihood-based Surprise Coverage (LSC), Input Distribution Coverage (IDC), and Mutation Score (MS). Our extensive empirical evaluation across multiple DNN models and input sets such as ImageNet, reveals a strong linear correlation between the predicted and actual FDR values derived from MS, DSC, and IDC, with minimum R^2 values of 0.94 for MS and 0.90 for DSC and IDC. Furthermore, a low average Root Mean Square Error (RMSE) of 9% between actual and predicted FDR values across all subjects, when relying on regression analysis and MS, demonstrates the latter's superior accuracy when compared to DSC and IDC, with RMSE values of 0.17 and 0.18, respectively. Overall, these results suggest that TEASMA provides a reliable basis for confidently deciding whether to trust test results for DNN models.
翻译:深度神经网络(DNNs)的成功部署需要利用充分的测试集对其进行验证,以确保对测试结果具有足够高的置信度。尽管目前已针对DNNs提出了成熟的测试充分性评估技术,我们仍需研究如何将这些技术应用于一个综合性的方法学中,以准确预测测试集的故障检测能力,进而评估其充分性。本文提出并评估了TEASMA,这是一个旨在准确评估DNNs测试集充分性的综合性实用方法学。在实践中,TEASMA使工程师能够判断是否可以信任高精度的测试结果,从而在DNN部署前完成验证。基于DNN模型的训练集,TEASMA提供了一套流程,能够利用现有的充分性度量指标,构建精确的、针对特定DNN的测试集故障检测率(FDR)预测模型,从而实现对其充分性的评估。我们使用四种先进的测试充分性度量指标对TEASMA进行了评估:基于距离的意外覆盖率(DSC)、基于似然的意外覆盖率(LSC)、输入分布覆盖率(IDC)以及变异分数(MS)。我们在多个DNN模型和输入集(例如ImageNet)上进行的广泛实证评估表明,基于MS、DSC和IDC得到的预测FDR值与实际FDR值之间存在强线性相关性,其中MS的最小R^2值为0.94,DSC和IDC的最小R^2值为0.90。此外,当依赖回归分析和MS时,所有实验对象的实际FDR值与预测FDR值之间的平均均方根误差(RMSE)低至9%,这表明相较于RMSE值分别为0.17和0.18的DSC和IDC,MS具有更高的准确性。总体而言,这些结果表明TEASMA为判断是否应信任DNN模型的测试结果提供了一个可靠的决策依据。