Black-Box Testing of Deep Neural Networks Through Test Case Diversity

Deep Neural Networks (DNNs) have been extensively used in many areas including image processing, medical diagnostics, and autonomous driving. However, DNNs can exhibit erroneous behaviours that may lead to critical errors, especially when used in safety-critical systems. Inspired by testing techniques for traditional software systems, researchers have proposed neuron coverage criteria, as an analogy to source code coverage, to guide the testing of DNN models. Despite very active research on DNN coverage, several recent studies have questioned the usefulness of such criteria in guiding DNN testing. Further, from a practical standpoint, these criteria are white-box as they require access to the internals or training data of DNN models, which is in many contexts not feasible or convenient. In this paper, we investigate black-box input diversity metrics as an alternative to white-box coverage criteria. To this end, we first select and adapt three diversity metrics and study, in a controlled manner, their capacity to measure actual diversity in input sets. We then analyse their statistical association with fault detection using four datasets and five DNN models. We further compare diversity with state-of-the-art white-box coverage criteria. Our experiments show that relying on the diversity of image features embedded in test input sets is a more reliable indicator than coverage criteria to effectively guide the testing of DNNs. Indeed, we found that one of our selected black-box diversity metrics far outperforms existing coverage criteria in terms of fault-revealing capability and computational time. Results also confirm the suspicions that state-of-the-art coverage metrics are not adequate to guide the construction of test input sets to detect as many faults as possible with natural inputs.

翻译：深度神经网络（DNN）已广泛应用于图像处理、医疗诊断及自动驾驶等多个领域。然而，DNN可能表现出错误行为，尤其在安全关键系统中使用时可能导致重大错误。受传统软件系统测试技术的启发，研究者提出了神经元覆盖准则作为源代码覆盖的类比指标，以指导DNN模型的测试。尽管关于DNN覆盖的研究非常活跃，但近期多项研究对该准则在指导DNN测试中的有效性提出了质疑。此外，从实践角度看，这些准则属于白盒测试，因为它们需要访问DNN模型的内部结构或训练数据，这在许多场景下不可行或不便捷。本文研究黑盒输入多样性指标作为白盒覆盖准则的替代方案。为此，我们首先选取并适配了三项多样性指标，以受控方式研究其测量输入集真实多样性的能力，随后利用四个数据集和五个DNN模型分析其与故障检测的统计关联性，并进一步将多样性与现有最优白盒覆盖准则进行比较。实验表明，依赖测试输入集中嵌入的图像特征多样性，比依赖覆盖准则更能有效指导DNN测试。事实上，我们发现所选黑盒多样性指标之一在故障揭示能力和计算时间上均远超现有覆盖准则。结果还证实了现有最优覆盖准则不足以指导构建能通过自然输入检测尽可能多故障的测试输入集这一怀疑。