When deploying Deep Neural Networks (DNNs), developers often convert models from one deep learning framework to another (e.g., TensorFlow to PyTorch). However, this process is error-prone and can impact target model accuracy. To identify the extent of such impact, we perform and briefly present a differential analysis against three DNNs widely used for image recognition (MobileNetV2, ResNet101, and InceptionV3) converted across four well-known deep learning frameworks (PyTorch, Keras, TensorFlow (TF), and TFLite), which revealed numerous model crashes and output label discrepancies of up to 72%. To mitigate such errors, we present a novel approach towards fault localization and repair of buggy deep learning framework conversions, focusing on pre-trained image recognition models. Our technique consists of four stages of analysis: 1) conversion tools, 2) model parameters, 3) model hyperparameters, and 4) graph representation. In addition, we propose various strategies towards fault repair of the faults detected. We implement our technique on top of the Apache TVM deep learning compiler, and we test it by conducting a preliminary fault localization analysis for the conversion of InceptionV3 from TF to TFLite. Our approach detected a fault in a common DNN converter tool, which introduced precision errors in weights, reducing model accuracy. After our fault localization, we repaired the issue, reducing our conversion error to zero.
翻译:在部署深度神经网络(DNN)时,开发者经常将模型从一个深度学习框架转换到另一个框架(例如,从TensorFlow转换到PyTorch)。然而,此过程容易出错,并可能影响目标模型的准确性。为了识别这种影响的程度,我们对三个广泛用于图像识别的DNN(MobileNetV2、ResNet101和InceptionV3)进行了差异分析并简要呈现结果,这些模型在四个著名的深度学习框架(PyTorch、Keras、TensorFlow(TF)和TFLite)之间转换,揭示了大量模型崩溃和高达72%的输出标签差异。为缓解此类错误,我们提出了一种针对有缺陷的深度学习框架转换的故障定位与修复新方法,重点关注预训练图像识别模型。我们的技术包括四个分析阶段:1)转换工具,2)模型参数,3)模型超参数,以及4)图表示。此外,我们提出了针对检测到的故障的各种修复策略。我们在Apache TVM深度学习编译器之上实现了该技术,并通过进行InceptionV3从TF到TFLite转换的初步故障定位分析来测试它。我们的方法在常见的DNN转换器工具中检测到一个故障,该故障引入了权重的精度误差,降低了模型准确性。在故障定位之后,我们修复了该问题,将转换误差降至零。