Deep Neural Networks~(DNNs) have been widely deployed in software to address various tasks~(e.g., autonomous driving, medical diagnosis). However, they could also produce incorrect behaviors that result in financial losses and even threaten human safety. To reveal the incorrect behaviors in DNN and repair them, DNN developers often collect rich unlabeled datasets from the natural world and label them to test the DNN models. However, properly labeling a large number of unlabeled datasets is a highly expensive and time-consuming task. To address the above-mentioned problem, we propose NSS, Neuron Sensitivity guided test case Selection, which can reduce the labeling time by selecting valuable test cases from unlabeled datasets. NSS leverages the internal neuron's information induced by test cases to select valuable test cases, which have high confidence in causing the model to behave incorrectly. We evaluate NSS with four widely used datasets and four well-designed DNN models compared to SOTA baseline methods. The results show that NSS performs well in assessing the test cases' probability of fault triggering and model improvement capabilities. Specifically, compared with baseline approaches, NSS obtains a higher fault detection rate~(e.g., when selecting 5\% test case from the unlabeled dataset in MNIST \& LeNet1 experiment, NSS can obtain 81.8\% fault detection rate, 20\% higher than baselines).
翻译:深度神经网络(DNN)已广泛部署于各类软件以处理多种任务(如自动驾驶、医疗诊断)。然而,它们可能产生错误行为,导致经济损失甚至威胁人类安全。为揭示并修复DNN中的错误行为,开发者常从自然场景收集大量无标注数据集并人工标注以测试DNN模型。但大规模无标注数据集的标注工作成本高昂、耗时巨大。针对上述问题,我们提出NSS(神经元灵敏度引导的测试用例选择方法),通过从无标注数据集中筛选高价值测试用例来减少标注时间。NSS利用测试用例激发的神经元内部信息,选择具有高置信度引发模型错误行为的测试用例。我们使用四个广泛采用的数据集和四个精心设计的DNN模型,与当前最优基线方法进行对比评估。结果表明,NSS在评估测试用例的故障触发概率与模型改进能力方面表现优异。具体而言,与基线方法相比,NSS实现了更高的故障检测率(例如,在MNIST与LeNet1实验中,从无标注数据集仅选取5%测试用例时,NSS可获得81.8%的故障检测率,较基线方法提升20%)。