MutateNN: Mutation Testing of Image Recognition Models Deployed on Hardware Accelerators

The increased utilization of Artificial Intelligence (AI) solutions brings with it inherent risks, such as misclassification and sub-optimal execution time performance, due to errors introduced in their deployment infrastructure because of problematic configuration and software faults. On top of that, AI methods such as Deep Neural Networks (DNNs) are utilized to perform demanding, resource-intensive and even safety-critical tasks, and in order to effectively increase the performance of the DNN models deployed, a variety of Machine Learning (ML) compilers have been developed, allowing compatibility of DNNs with a variety of hardware acceleration devices, such as GPUs and TPUs. Furthermore the correctness of the compilation process should be verified. In order to allow developers and researchers to explore the robustness of DNN models deployed on different hardware accelerators via ML compilers, in this paper we propose MutateNN, a tool that provides mutation testing and model analysis features in the context of deployment on different hardware accelerators. To demonstrate the capabilities of MutateNN, we focus on the image recognition domain by applying mutation testing to 7 well-established models utilized for image classification. We instruct 21 mutations of 6 different categories, and deploy our mutants on 4 different hardware acceleration devices of varying capabilities. Our results indicate that models are proven robust to changes related to layer modifications and arithmetic operators, while presenting discrepancies of up to 90.3% in mutants related to conditional operators. We also observed unexpectedly severe performance degradation on mutations related to arithmetic types of variables, leading the mutants to produce the same classifications for all dataset inputs.

翻译：人工智能（AI）解决方案的广泛应用带来了固有风险，例如因部署基础设施中配置问题和软件故障导致的错误分类及次优执行时间性能。此外，深度神经网络（DNN）等AI方法常被用于执行高要求、资源密集型甚至安全关键型任务。为有效提升已部署DNN模型的性能，各类机器学习（ML）编译器应运而生，使DNN能够兼容GPU、TPU等多种硬件加速设备，而编译过程的正确性亟需验证。为使开发人员和研究者能够探索通过ML编译器部署于不同硬件加速器上的DNN模型的鲁棒性，本文提出MutateNN工具，该工具针对不同硬件加速器部署场景，提供突变测试与模型分析功能。为展示MutateNN的能力，我们聚焦图像识别领域，对7个用于图像分类的成熟模型执行突变测试：设计了6个类别的21种突变，并将突变体部署于4种不同性能的硬件加速设备上。结果表明，模型对层修改和算术运算符相关突变表现出鲁棒性，而条件运算符相关突变体间的差异最高达90.3%。此外，我们观察到算术类型变量突变导致意外的严重性能退化，使突变体对所有数据集输入产生相同分类结果。