Deep neural networks (DNNs) have demonstrated extraordinary capabilities and are an integral part of modern software systems. However, they also suffer from various vulnerabilities such as adversarial attacks and unfairness. Testing deep learning (DL) systems is therefore an important task, to detect and mitigate those vulnerabilities. Motivated by the success of traditional software testing, which often employs diversity heuristics, various diversity measures on DNNs have been proposed to help efficiently expose the buggy behavior of DNNs. In this work, we argue that many DNN testing tasks should be treated as directed testing problems rather than general-purpose testing tasks, because these tasks are specific and well-defined. Hence, the diversity-based approach is less effective. Following our argument based on the semantics of DNNs and the testing goal, we derive $6$ metrics that can be used for DNN testing and carefully analyze their application scopes. We empirically show their efficacy in exposing bugs in DNNs compared to recent diversity-based metrics. Moreover, we also notice discrepancies between the practices of the software engineering (SE) community and the DL community. We point out some of these gaps, and hopefully, this can lead to bridging the SE practice and DL findings.
翻译:深度神经网络(DNN)已展现出非凡的能力,并成为现代软件系统不可或缺的组成部分。然而,它们也存在各种漏洞,例如对抗性攻击和不公平性。因此,测试深度学习(DL)系统是一项重要任务,旨在检测并缓解这些漏洞。受传统软件测试成功经验的启发(传统测试通常采用多样性启发式方法),研究者提出了多种针对DNN的多样性度量方法,以帮助高效暴露DNN的异常行为。在本工作中,我们认为许多DNN测试任务应被视为定向测试问题,而非通用测试任务,因为这些任务具有明确且具体的定义。因此,基于多样性的方法效果较差。基于我们对DNN语义和测试目标的论证,我们推导出6种可用于DNN测试的度量指标,并仔细分析了其应用范围。我们通过实验证明了与近期基于多样性的度量指标相比,这些指标在暴露DNN错误方面的有效性。此外,我们还注意到软件工程(SE)社区与DL社区实践之间的差异。我们指出了其中一些差距,希望这能促进SE实践与DL研究成果之间的衔接。