Aries: Efficient Testing of Deep Neural Networks via Labeling-Free Accuracy Estimation

Deep learning (DL) plays a more and more important role in our daily life due to its competitive performance in industrial application domains. As the core of DL-enabled systems, deep neural networks (DNNs) need to be carefully evaluated to ensure the produced models match the expected requirements. In practice, the \emph{de facto standard} to assess the quality of DNNs in the industry is to check their performance (accuracy) on a collected set of labeled test data. However, preparing such labeled data is often not easy partly because of the huge labeling effort, i.e., data labeling is labor-intensive, especially with the massive new incoming unlabeled data every day. Recent studies show that test selection for DNN is a promising direction that tackles this issue by selecting minimal representative data to label and using these data to assess the model. However, it still requires human effort and cannot be automatic. In this paper, we propose a novel technique, named \textit{Aries}, that can estimate the performance of DNNs on new unlabeled data using only the information obtained from the original test data. The key insight behind our technique is that the model should have similar prediction accuracy on the data which have similar distances to the decision boundary. We performed a large-scale evaluation of our technique on two famous datasets, CIFAR-10 and Tiny-ImageNet, four widely studied DNN models including ResNet101 and DenseNet121, and 13 types of data transformation methods. Results show that the estimated accuracy by \textit{Aries} is only 0.03\% -- 2.60\% off the true accuracy. Besides, \textit{Aries} also outperforms the state-of-the-art labeling-free methods in 50 out of 52 cases and selection-labeling-based methods in 96 out of 128 cases.

翻译：深度学习（DL）凭借其在工业应用领域的竞争性能，在日常生活中发挥着越来越重要的作用。作为基于深度学习的系统的核心，深度神经网络（DNN）需要被仔细评估，以确保生成的模型满足预期要求。在实践中，工业界评估DNN质量的既定标准是在收集的有标注测试数据集上检查其性能（准确性）。然而，准备此类有标注数据通常并不容易，部分原因是由于巨大的标注工作量，即数据标注是劳动密集型的，尤其是每天有大量新的无标注数据涌入时。近期研究表明，针对DNN的测试选择是一个有前景的方向，它通过选择最小代表性数据进行标注并利用这些数据评估模型来解决该问题。然而，这仍然需要人工参与且无法自动化。本文提出了一种名为_Aries_的新技术，该技术仅通过原始测试数据获取的信息，就能估算DNN在新无标注数据上的性能。该技术背后的关键洞见在于，模型在距离决策边界相似距离的数据上应具有相似的预测精度。我们在两个著名数据集（CIFAR-10和Tiny-ImageNet）、四种广泛研究的DNN模型（包括ResNet101和DenseNet121）以及13种数据变换方法上对该技术进行了大规模评估。结果表明，_Aries_估算的准确度与真实准确度仅相差0.03%-2.60%。此外，在52个案例中，_Aries_在50个案例中优于最先进的无标注方法，在128个案例中在96个案例中优于基于选择标注的方法。