Search-based software testing (SBST) is a widely adopted technique for testing complex systems with large input spaces, such as Deep Learning-enabled (DL-enabled) systems. Many SBST techniques focus on Pareto-based optimization, where multiple objectives are optimized in parallel to reveal failures. However, it is important to ensure that identified failures are spread throughout the entire failure-inducing area of a search domain and not clustered in a sub-region. This ensures that identified failures are semantically diverse and reveal a wide range of underlying causes. In this paper, we present a theoretical argument explaining why testing based on Pareto optimization is inadequate for covering failure-inducing areas within a search domain. We support our argument with empirical results obtained by applying two widely used types of Pareto-based optimization techniques, namely NSGA-II (an evolutionary algorithm) and MOPSO (a swarm-based algorithm), to two DL-enabled systems: an industrial Automated Valet Parking (AVP) system and a system for classifying handwritten digits. We measure the coverage of failure-revealing test inputs in the input space using a metric that we refer to as the Coverage Inverted Distance quality indicator. Our results show that NSGA-II and MOPSO are not more effective than a na\"ive random search baseline in covering test inputs that reveal failures. The replication package for this study is available in a GitHub repository.
翻译:基于搜索的软件测试(SBST)是一种广泛采用的测试技术,适用于具有大规模输入空间的复杂系统,例如支持深度学习(DL-enabled)的系统。许多SBST技术侧重于基于帕累托的优化,即并行优化多个目标以揭示故障。然而,确保识别出的故障分布于搜索域中整个故障诱发区域而非聚集在子区域内至关重要。这保证了识别出的故障在语义上具有多样性,并能揭示广泛的根本原因。本文提出一个理论论证,解释为何基于帕累托优化的测试不足以覆盖搜索域内的故障诱发区域。我们通过实证结果支持这一论点,这些结果来自将两种广泛使用的基于帕累托优化技术——NSGA-II(一种进化算法)和MOPSO(一种基于群体的算法)——应用于两个支持深度学习的系统:一个工业级自动代客泊车(AVP)系统和一个手写数字分类系统。我们使用一种称为覆盖倒置距离质量指标的度量标准,测量了输入空间中故障揭示测试输入的覆盖情况。我们的结果表明,在覆盖揭示故障的测试输入方面,NSGA-II和MOPSO并不比简单的随机搜索基线更有效。本研究的复现包可在GitHub仓库中获取。