A Differential Testing Framework to Evaluate Image Recognition Model Robustness

Image recognition tasks typically use deep learning and require enormous processing power, thus relying on hardware accelerators like GPUs and TPUs for fast, timely processing. Failure in real-time image recognition tasks can occur due to sub-optimal mapping on hardware accelerators during model deployment, which may lead to timing uncertainty and erroneous behavior. Mapping on hardware accelerators is done through multiple software components like deep learning frameworks, compilers, device libraries, that we refer to as the computational environment. Owing to the increased use of image recognition tasks in safety-critical applications like autonomous driving and medical imaging, it is imperative to assess their robustness to changes in the computational environment, as the impact of parameters like deep learning frameworks, compiler optimizations, and hardware devices on model performance and correctness is not well understood. In this paper we present a differential testing framework, which allows deep learning model variant generation, execution, differential analysis and testing for a number of computational environment parameters. Using our framework, we conduct an empirical study of robustness analysis of three popular image recognition models using the ImageNet dataset, assessing the impact of changing deep learning frameworks, compiler optimizations, and hardware devices. We report the impact in terms of misclassifications and inference time differences across different settings. In total, we observed up to 72% output label differences across deep learning frameworks, and up to 82% unexpected performance degradation in terms of inference time, when applying compiler optimizations. Using the analysis tools in our framework, we also perform fault analysis to understand the reasons for the observed differences.

翻译：图像识别任务通常采用深度学习技术，需要巨大的处理能力，因此依赖GPU和TPU等硬件加速器以实现快速、及时的处理。实时图像识别任务可能因模型部署过程中在硬件加速器上的次优映射而失败，这可能导致时序不确定性和异常行为。硬件加速器上的映射通过多个软件组件完成，包括深度学习框架、编译器、设备库等，我们将这些统称为计算环境。由于图像识别任务在自动驾驶、医学成像等安全关键型应用中的使用日益增多，评估其对计算环境变化的鲁棒性至关重要，因为深度学习框架、编译器优化和硬件设备等参数对模型性能和正确性的影响尚未得到充分理解。本文提出一种差分测试框架，该框架能够实现深度学习模型变体生成、执行、差分分析以及针对多种计算环境参数的测试。利用该框架，我们使用ImageNet数据集对三种主流图像识别模型进行了鲁棒性分析实证研究，评估了更换深度学习框架、编译器优化和硬件设备带来的影响。我们从不同设置下的误分类和推理时间差异两方面报告了影响。总体而言，我们发现不同深度学习框架间的输出标签差异高达72%，而应用编译器优化后推理时延出现高达82%的意外性能下降。借助框架中的分析工具，我们还进行了故障分析以理解所观测差异的原因。