Testing deep learning-based systems is crucial but challenging due to the required time and labor for labeling collected raw data. To alleviate the labeling effort, multiple test selection methods have been proposed where only a subset of test data needs to be labeled while satisfying testing requirements. However, we observe that such methods with reported promising results are only evaluated under simple scenarios, e.g., testing on original test data. This brings a question to us: are they always reliable? In this paper, we explore when and to what extent test selection methods fail for testing. Specifically, first, we identify potential pitfalls of 11 selection methods from top-tier venues based on their construction. Second, we conduct a study on five datasets with two model architectures per dataset to empirically confirm the existence of these pitfalls. Furthermore, we demonstrate how pitfalls can break the reliability of these methods. Concretely, methods for fault detection suffer from test data that are: 1) correctly classified but uncertain, or 2) misclassified but confident. Remarkably, the test relative coverage achieved by such methods drops by up to 86.85%. On the other hand, methods for performance estimation are sensitive to the choice of intermediate-layer output. The effectiveness of such methods can be even worse than random selection when using an inappropriate layer.
翻译:测试基于深度学习的系统至关重要但具有挑战性,主要原因在于对收集的原始数据进行标注需要大量时间和人力。为减轻标注负担,研究者提出了多种测试选择方法,仅需对部分测试数据进行标注即可满足测试需求。然而,我们观察到这些在文献中报告了良好结果的方法仅针对简单场景进行了评估(例如在原始测试数据上进行测试)。这引出一个问题:它们是否始终可靠?本文探讨了测试选择方法在何种情形下及何种程度上会失效。具体而言:首先,基于顶级会议中11种选择方法的构造原理,我们识别了其潜在缺陷;其次,我们针对五个数据集(每个数据集采用两种模型架构)展开实验,实证确认了这些缺陷的存在性;最后,我们展示了缺陷如何破坏方法的可靠性。具体而言,用于故障检测的方法在两类测试数据中表现不佳:1)正确分类但结果不确定的数据;或2)错误分类但模型置信度高的数据。值得注意的是,此类方法实现的测试相对覆盖率最高下降86.85%。而用于性能估计的方法则对中间层输出的选择高度敏感。若采用不恰当的中间层,此类方法的有效性甚至可能劣于随机选择。