Surprise Adequacy (SA) has been widely studied as a test adequacy metric that can effectively guide software engineers towards inputs that are more likely to reveal unexpected behaviour of Deep Neural Networks (DNNs). Intuitively, SA is an out-of-distribution metric that quantifies the dissimilarity between the given input and the training data: if a new input is very different from those seen during training, the DNN is more likely to behave unexpectedly against the input. While SA has been widely adopted as a test prioritization method, its major weakness is the fact that the computation of the metric requires access to the training dataset, which is often not allowed in real-world use cases. We present DANDI, a technique that generates a surrogate input distribution using Stable Diffusion to compute SA values without requiring the original training data. An empirical evaluation of DANDI applied to image classifiers for CIFAR10 and ImageNet-1K shows that SA values computed against synthetic data are highly correlated with the values computed against the training data, with Spearman Rank correlation value of 0.852 for ImageNet-1K and 0.881 for CIFAR-10. Further, we show that SA value computed by DANDI achieves can prioritize inputs as effectively as those computed using the training data, when testing DNN models mutated by DeepMutation. We believe that DANDI can significantly improve the usability of SA for practical DNN testing.
翻译:惊奇充分性(SA)作为一种测试充分性度量已被广泛研究,它能有效指导软件工程师发现更可能揭示深度神经网络(DNN)意外行为的输入。直观而言,SA是一种分布外度量,用于量化给定输入与训练数据之间的差异:若新输入与训练期间所见数据差异显著,则DNN对该输入产生意外行为的可能性更高。尽管SA已被广泛采纳为测试优先级排序方法,其主要缺陷在于计算该度量需要访问训练数据集,而这在实际应用场景中往往不可行。本文提出DANDI技术,该方法利用Stable Diffusion生成替代输入分布以计算SA值,无需原始训练数据。对应用于CIFAR10和ImageNet-1K图像分类器的DANDI进行实证评估表明:基于合成数据计算的SA值与基于训练数据计算的值高度相关,ImageNet-1K的斯皮尔曼等级相关系数为0.852,CIFAR-10为0.881。此外,我们证明在测试经过DeepMutation变异的DNN模型时,DANDI计算的SA值能够实现与使用训练数据计算值同等有效的输入优先级排序。我们相信DANDI能显著提升SA在实际DNN测试中的可用性。