Simulators are widely used to test Autonomous Driving Systems (ADS), but their potential flakiness can lead to inconsistent test results. We investigate test flakiness in simulation-based testing of ADS by addressing two key questions: (1) How do flaky ADS simulations impact automated testing that relies on randomized algorithms? and (2) Can machine learning (ML) effectively identify flaky ADS tests while decreasing the required number of test reruns? Our empirical results, obtained from two widely-used open-source ADS simulators and five diverse ADS test setups, show that test flakiness in ADS is a common occurrence and can significantly impact the test results obtained by randomized algorithms. Further, our ML classifiers effectively identify flaky ADS tests using only a single test run, achieving F1-scores of $85$%, $82$% and $96$% for three different ADS test setups. Our classifiers significantly outperform our non-ML baseline, which requires executing tests at least twice, by $31$%, $21$%, and $13$% in F1-score performance, respectively. We conclude with a discussion on the scope, implications and limitations of our study. We provide our complete replication package in a Github repository.
翻译:模拟器被广泛用于测试自动驾驶系统(ADS),但其潜在的不稳定性可能导致不一致的测试结果。我们通过解决两个关键问题,研究了基于模拟的ADS测试中的测试不稳定性:(1)不稳定的ADS模拟如何影响依赖随机算法的自动化测试?(2)机器学习(ML)能否有效识别不稳定的ADS测试,同时减少所需的测试重运行次数?我们基于两个广泛使用的开源ADS模拟器和五种不同的ADS测试设置获得的实证结果表明,ADS中的测试不稳定性是一种常见现象,并且会显著影响随机算法获得的测试结果。此外,我们的ML分类器仅使用一次测试运行即可有效识别不稳定的ADS测试,在三种不同的ADS测试设置中分别达到了$85$%、$82$%和$96$%的F1分数。与我们的非ML基线(需要至少执行两次测试)相比,我们的分类器在F1分数性能上分别显著提高了$31$%、$21$%和$13$%。最后,我们讨论了本研究的研究范围、影响和局限性。我们在Github仓库中提供了完整的可复现包。